Module 1

Using RStudio for basic statistics

Aims and objectives for Module 1

  1. Understand sampling, simple principles of experimental design
  2. Read in a data file (*.csv)
  3. Run some summary statistics
  4. Understand the simple concepts of experimental design
  5. Design a simple experiment and test for a difference in two means
  6. Open a Project in R
  7. Become familiar with using R and RStudio for simple statistics
  8. Review some plots available in R (later in Module 6 we will provide code and examples)

This course is self-paced, available 24 hours 7 days a week and is free. It is being developed in response to scientist’s needs. If you like it please send us feedback. We are hoping to make it extra-specially useful with feedback and suggestions from our users. We hope students, early career scientists, lecturers, and tutors may all find the materials useful.


Register for the course and answer a few questions

(Please take this simple survey)

To allow us to gather some data on the users of the course we would appreciate if you would fill out a short Pre-Knowledge Survey. At the end of the module we will use the similar questions to allow you to see how your skills have improved.


Before you start Module 1 you will need to install R and RStudio to be able to complete the demonstrations and tasks – This only needs to be done once. For how to install R and RStudio please refer to Module 0.

After you have installed the software you will be able to Visualise, Summarise and Analyse data freely. In Module 0 there are also guides to assist in reading in data, using Excel and basic tasks with R. These are considered to be the prerequisites for Module 1 and so will need to be completed before going any further.

When you are read try download the demo documents (provided as pdf) and work through the commands in R. Also look under technical asides (under the Support menu above) and make sure you have a clear idea of the measures of central tendancy and measures of spread for data distributions. Make sure you can calculate a mean and a standard deviation of a small sample.
Back to top

Open a Project in RStudio for this Module

 Bringing data into R and some simple checks

Data can be saved in Excel, and then saved as  a *.csv which is easier to bring  into R.
In Module 0 you learned how to bring in a data as a *.csv file.
Here we look at the data as it comes into R and check it.

Each line in the data set should be an individual plot or experimental unit, columns are the variables.
In addition we need to understand whether the data has all or some of the following:

  1. An identity (ID) column (This could be pot number (in a glasshouse study),a plot in a field study, a household ID (say in a survey) or the pen in a livestock study)
  2. Structural design columns – we mean here blocks or rows and columns
  3. Treatment design columns – we mean here the treatments we are interested in (eg crop variety)
  4. Our measurement variables – This could be yield, plant height, animal weight, household size etc

This will become clearer when we have designed and ran some experiments

Use this demonstration to import data into R   Importing data into R (pdf)

Back to top

Properties of Distributions

In this section we will present some of the important properties of the distributions used in this module.

Normal Distribution

The Normal Distribution is the most important and widely used continuous probability distribution. Many natural phenomena are approximately normally distributed. The normal distribution is used to model populations so both the population mean and variances need to be known (or can be calculated).
Credit: Wikipedia


The Student’s t-distribution, is a continuous probability distribution that allows the modelling of normally distributed variables with small sample sizes and unknown population variance. As the sample size increases the more closely the t-distribution approaches the Normal Distribution. When looking at the distribution with infinite degrees of freedom we see that it approximates the Normal distribution very closely.

When talking about statistical analyses using the t-distribution we need to state the degrees of freedom of the distribution. This can be generally calculated as the sample size minus the number of parameters that we are assuming.

Understanding Sampling

When we conduct an experiment we are looking to determine properties of the the population of interest. You may think the theortical ideal may be to do an experiment and to study all the units in the population of interest and then analyse the data collected. However, this is usually not feasible due to the size of the population of interest, equipment availability and cost. Also knowledge of statistical theory and experimental design allow good inference to be obtianed from a well designed controlled experiment.

Sampling is the approach taken to select experimental units from a population, to produce a smaller group of units that is representative of the larger populations. So the sample should have similar properties and proportions to the large population. The statistics calculated from the sample can then be used to infer values for the parameters of the population. A sample mean is an example of a statistic and a population mean is an example of a parameter.

It is very important that the sample is representative of the population of interest. Therefore, once we have our research question we must consider the design of the experiment and if any factors need to be accounted.

Back to top

Principles of Experimental Design

Lets think about the three principles (3 R’s) of experimental design:

Randomisation, Replication and Reduction of Experimental error.

Here are some notes Design Notes  to help explain these principles.

Here is a presentation as a pdf Principles of experimental design

We also suggest watching the screencast below, and the video in Module 2 on Developing a reserch question for your study.

Let’s now consider our Research Question

In your work you will have arrived at a point where there is a need to run an experiment. It is vital to be able to describe the aim of the experiment to a lay person or a high school student in such a way that you can clearly express the research question.  Can you do this ?
Watch this short screencast on the Principles of Experimental Design, we will develop these ideas further  in Modules 2 and 3.

Back to top

Research questions

Look at this Powerpoint  Developing Research Questions

Back to top

Thinking about experimental units. Do you understand what an experimental unit is?

In the case of an experiment in the field, the experimental unit is the smallest unit to which you apply the treatment. If we plant different varieties in different plots the plots are considered independent. In experiments with two levels of plot size (such as in split plot designs, we still consider the experimental unit to be the smallest unit that can be considered independent at the subplot level, but we also have main plots in such a design).
Pseudo replication, must be avoided because psuedo-replicates are NOT replicates. “Psuedo-replication” and” technical replication” where multiple samples are measured within an experimental unit are not independent.
Back to top

Thinking about exploratory analysis in R

Exploratory Analysis (Presentation as pdf)

Back to top

Thinking about developing hypothesis tests for your statistical test

These slides show how developing an hypothesis is a first step in thinking about testing differences using statistical tests.

Hypothesis Testing ( as pdf)


A simple t test

What type of t test are we undertaking? A two sample t test is the simplest design we can use.

We assume the plots are independent and we assume the treatments have been randomly allocated to the plots of land. the plots are adjacent within the field.
Are t test assumptions satisfied? You must have randomised these treatments. In this situation there should be no environmental (soils gradient) in the field.
Lets undertake a t test.

Watch this screencast on the two sample t test

Now try and run this in R for yourself, following this demonstration

T Test demonstration (pdf)

Watch the screen cast and then look at the two files of R code. We are introducing one of the simplest experiments. We first explain the process and then we will show the R code so you can try yourself.

Here are two files (these are two pdfs to look at, and then you can try yo run the code for yourself)

The first shows the R code in RStudio to randomise a simple two treatment experiment

The second shows the R code and output as run in RStudio to analyse the simple two treatment experiment

Simple summary statistics

We have a named dataframe which we generally have imported into R

head(mydata)   ## will print the top 6 rows of data in your data frame

tail(mydata)   ## will print the last 6 rows of data

We can also refer to variables or factors within  a dataframe


The dollar sign and the variable name after the dataset name allows us to refer to one variable within a data set

We may like to get the summary for our numeric values



We will discuss factors in Module 2

Plotting in R a short introduction

There are several basic plots in R. Take a tour through this presentation to see what you can achieve in R. Download this file to look at some aspects of plotting. Later in Module 6 we will be supporting you to use the plotting packages in R. The most used are library(lattice) and library(gglot)
Plotting Demonstration (pdf)

Back to top


Module 1 Knowledge Test (5 multiple choice questions)

Take the post module test

Click here to take a short test


One Response

Leave a Reply