Using RStudio for basic statistics
Aims and objectives for module 1
- Understand sampling, simple principles of experimental design
- Read in a data file (*.csv)
- Run some summary statistics
- Understand the simple concepts of experimental design
- Design a simple experiment and test for a difference in two means
- Open a Project in R
- Become familiar with using R and RStudio for simple statistics, and run a t test
- Review some plots available in R (later in Module 6 we will provide code and examples)
This course is self-paced, available 24 hours 7 days a week and is free. It is being developed in response to scientist’s needs. If you like it please send us feedback. We are hoping to make it extra-especially useful with feedback and suggestions from our users. We hope students, early career scientists, lecturers, and tutors may all find the materials useful. TIP: If you haven’t looked at Module 1 try to look at it first to get you started with loading R on your computer.
Contents
- Register for the course and answer a few questions
- Introduction
- How to open a Project in R for this module
- Properties of distributions
- Research Questions and developing hypothesis
- Understanding Sampling
- Principles of Experimental Design
- Thinking about your experimental units
- Thinking about developing hypothesis tests for your statistical test
- A simple study – a simple t test
- Producing some summary statistics
- Simple plotting in R
Register for the course and answer a few questions
(Please take this simple survey) Also please become a member, and we email you from time to time.
To allow us to gather some data on the users of the course we would appreciate if you would fill out a short Pre-Knowledge Survey. At the end of the module we will use the similar questions to allow you to see how your skills have improved.
Introduction
Before you start Module 1 you will need to install R and RStudio to be able to complete the demonstrations and tasks – This only needs to be done once. For how to install R and RStudio please refer to Module 0.
After you have installed the software you will be able to Visualise, Summarise and Analyse data freely. In Module 0 there are also guides to assist in reading in data, using Excel and basic tasks with R. These are considered to be the prerequisites for Module 1 and so will need to be completed before going any further.
When you are read try download the demo documents (provided as pdf) and work through the commands in R. Also look under technical asides (under the Support menu above) and make sure you have a clear idea of the measures of central tendency and measures of spread for data distributions. Make sure you can calculate a mean and a standard deviation of a small sample.
Back to top
How to open a Project in RStudio for this Module
Bringing data into R and some simple checks
Data can be saved in Excel, and then saved as a *.csv which is easier to bring into R.
In Module 0 you learned how to bring in a data as a *.csv file.
Here we look at the data as it comes into R and check it.
Each line in the data set should be an individual plot or experimental unit, columns are the variables.
In addition we need to understand whether the data has all or some of the following:
- An identity (ID) column (This could be pot number (in a glasshouse study),a plot in a field study, a household ID (say in a survey) or the pen in a livestock study)
- Structural design columns – we mean here blocks or rows and columns
- Treatment design columns – we mean here the treatments we are interested in (eg crop variety)
- Our measurement variables – This could be yield, plant height, animal weight, household size etc
This will become clearer when we have designed and ran some experiments
Use this demonstration to import data into R Importing data into R (pdf)
Properties of Distributions
In this section we will present some of the important properties of the distributions used in this module. A distribution shows the shape of the data as a frequency plot showing where we expect to find the data.
Normal Distribution
The Normal Distribution is the most important and widely used continuous probability distribution. Many natural phenomena are approximately normally distributed. The normal distribution is used to model populations so both the population mean and variances need to be known (or can be calculated).
Credit: Wikipedia
T-Distribution
The Student’s t-distribution, is a continuous probability distribution that allows the modelling of normally distributed variables with small sample sizes and unknown population variance. As the sample size increases the more closely the t-distribution approaches the Normal Distribution. When looking at the distribution with infinite degrees of freedom we see that it approximates the Normal distribution very closely.
When talking about statistical analyses using the t-distribution we need to state the degrees of freedom of the distribution. This can be generally calculated as the sample size minus the number of parameters that we are assuming.
Understanding Sampling
When we conduct an experiment we are looking to determine properties of the the population of interest. You may think the theortical ideal may be to do an experiment and to study all the units in the population of interest and then analyse the data collected. However, this is usually not feasible due to the size of the population of interest, equipment availability and cost. Also knowledge of statistical theory and experimental design allow good inference to be obtianed from a well designed controlled experiment.
Sampling is the approach taken to select experimental units from a population, to produce a smaller group of units that is representative of the larger populations. So the sample should have similar properties and proportions to the large population. The statistics calculated from the sample can then be used to infer values for the parameters of the population. A sample mean is an example of a statistic and a population mean is an example of a parameter.
It is very important that the sample is representative of the population of interest. Therefore, once we have our research question we must consider the design of the experiment and if any factors need to be accounted.
Principles of Experimental Design
Lets think about the three principles (3 R’s) of experimental design:
Randomisation, Replication and Reduction of Experimental error.
Here are some notes Design Notes to help explain these principles.
Here is a presentation as a pdf Principles of experimental design
We also suggest watching the screencast below, and the video in Module 2 on Developing a research question for your study.
Let’s now consider our Research Question
In your work you will have arrived at a point where there is a need to run an experiment. It is vital to be able to describe the aim of the experiment to a lay person or a high school student in such a way that you can clearly express the research question. Can you do this ?
Watch this short screencast on the Principles of Experimental Design, we will develop these ideas further in Modules 2 and 3.
Research questions
Look at this Powerpoint Developing Research Questions
Thinking about experimental units. Do you understand what an experimental unit is?
In the case of an experiment in the field, the experimental unit is the smallest unit to which you apply the treatment. If we plant different varieties in different plots the plots are considered independent. In experiments with two levels of plot size (such as in split plot designs, we still consider the experimental unit to be the smallest unit that can be considered independent at the subplot level, but we also have main plots in such a design).
Pseudo replication, must be avoided because psuedo-replicates are NOT replicates. “Psuedo-replication” and” technical replication” where multiple samples are measured within an experimental unit are not independent.
Back to top
Thinking about exploratory analysis in R
Exploratory Analysis (Presentation as pdf)
Back to top
Thinking about developing hypothesis tests for your statistical test
These slides show how developing an hypothesis is a first step in thinking about testing differences using statistical tests.
A simple t test
What type of t test are we undertaking? A two sample t test is the simplest design we can use.
We assume the plots are independent and we assume the treatments have been randomly allocated to the plots of land. the plots are adjacent within the field.
Are t test assumptions satisfied? You must have randomised these treatments. In this situation there should be no environmental (soils gradient) in the field.
Let’s undertake a t test. Here is the pdf on t tests T tests (P
Watch this screencast on the two sample t test
Now try and run this in R for yourself, following this demonstration
Watch the screen cast and then look at the two files of R code. We are introducing one of the simplest experiments. We first explain the process and then we will show the R code so you can try yourself.
Here are two files (these are two pdfs to look at, and then you can try to run the code for yourself)
The first shows the R code in RStudio to randomise a simple two treatment experiment
Ttest_randomisation
The second shows the R code and output as run in RStudio to analyse the simple two treatment experiment
Ttest_analysis
Simple summary statistics
We have a named dataframe which we generally have imported into R
head(mydata) ## will print the top 6 rows of data in your data frame
tail(mydata) ## will print the last 6 rows of data
We can also refer to variables or factors within a dataframe
mydata$variable1
The dollar sign and the variable name after the dataset name allows us to refer to one variable within a data set
We may like to get the summary for our numeric values
library(psych)
summary(datasetname)
We will discuss factors in Module 2
Plotting in R a short introduction
There are several basic plots in R. Take a tour through this presentation to see what you can achieve in R. Download this file to look at some aspects of plotting. Later in Module 6 we will be supporting you to use the plotting packages in R. The most used are library(lattice)
and library(gglot)
Plotting Demonstration (pdf)
Module 1 Knowledge Test (5 multiple choice questions)
Take the post module test
April 5, 2015 11:40 am
really helpful!
March 1, 2021 9:16 am
You got a very superb website, Glad I discovered it through yahoo.