Basic definitions

Types of data There are various types such as categorical and continuous.There are also distinctions between discrete values, and nominal, ordinal. count data allows us to look at categorical by categorical associations.

Central tendency of data The mean, mode and median represent describing a central tendency of a dataset.

Deviation from a mean is the difference between an individual value and its sample mean. The average of the deviations () from the mean equals zero.

\bar{d} = \frac{1}{n} \sum_{i= 1}^n (x -\bar{x}) = 0

Measures of spread of the data The range, the variance, the IQR and the standard deviation are measures of the spread of variability in a dataset.

Mean to get the mean, also known as the average, add all the values and divide by the number of values in the sample.

Mode the most frequent value. It is the value that appears most often. It might make sense in a question as to the most popular make of tractor. It is rarely used for measurement data.

Median is the middle number in a sorted list of data. One of its properties it is not affected very strongly by outliers. The median is also Q2 Second Quartile-50% of the ordered data is smaller than Q2.

Minimum the smallest value in the sample.

Maximum the largest value.

Inference is the use of data typically from an experiment (evidence) to infer something about a population.

Interquartile range IQR = Quartile 3 – Quartile 1.

IQR Quartile 3 – Quartile 1 (Q3-Q1) (see the definition of quartiles).

Outlier an outlier is an extreme small or extreme large value, it is distant from or does not seem to fit in with the other values. Some software will identify them, but whether it is actually an unusual value or not is arbitrary. This can be determined by arithmetic rules but they are not necessarily helpful.

Quantiles the division of a frequency distribution of data into equal quantities know generically as quantiles. We distinguish deciles (dividing into 10), quartiles dividing into 4, and percentiles dividing into 100 quantiles).

Quartile the division of the data, once in order, into 4 sections, each consisting of a quarter of the data. The quartiles are Q1, median and Q3.

Range is the maximum minus the minimum.

Deviation from the mean is the amount an individual value is above or below a mean.

Residual the difference between an individual value and its fitted value (eg the estimate of a particular mean in ANOVA or a fitted line in Regression).

Sums of Squares is used to get a value of the deviations squared and added up. It is the top part (also known as the numerator) of the variance formula.

Sample variance (s2) the sums of squares which are the deviations of individual values from the mean squared. This is then divided by n-1.

s^2 & = & \frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 + \ldots + (x_n-\bar{x})^2}{n-1}
& & \nonumber \\
s^2 & = & \frac{1}{n-1} \sum_{i=1}^n{(x_i-\bar{x})^2 }

Degrees of Freedom (df) This represents the amount of information in a group of values. (Think about this example: If you have 5 values (x − 1, x2, x − 3, x − 4, and x − 5 and you know the sample mean (). If you take one value away then you only need 4 degrees of freedom to work out the value of the item that was removed.) So df = n-1.

Standard Deviation the standard deviation (sd) is the square root of the variance. The variance is calculated and it is in squared values (eg the response variable is yield so the variance is in units of yield2).

s & = & \sqrt{s^2}

Standard error of the mean (SEM) It is the standard deviation of the sampling distribution of the mean.

\mbox{standard error of mean} &=& \frac{s}{\sqrt{n}}

Parameters parameters are values for a population, such as the population mean μ and population variance σ2. They are often denoted by the Greek letters.

Statistics statistics are estimates from a sample, such as the sample mean, and sample variance and sample size. Denoted by Roman letters such as s2 for σ2

Statistical significance – assesses the probability that a statistical result in a sample could be due to sampling error alone. A result is said to be statistically significant if it is unlikely to occur by chance.

Notation using Greek letters is used for the population. Eg the population mean (μ).

Notation using Roman letters is used for the statistics. Eg the sample mean .

Sample mean (we pronounce this as ’x bar’). We use this notation to denote a sample mean.

Normal distribution this is a symmetric bell shaped distribution which is defined by two parameters the mean (μ) and the standard deviation σ. Also written in short form as N ∼ (μ, σ2). The properties of the standard normal distribution enable us to use it’s characteristics for working with sample data.

Standard Normal is a normal distribution with a mean of zero and a variance of 1. In the short form a standardised variable is written as coming from a distribution N ∼ (0, 1). Remember both σ2 and σ are 1 for this distribution (as the square root of 1 is 1).

Observational studies There are just overall observations taken, which can be summarised for the descriptive statistics. There is no use of a formal experimental design and nothing can be concluded in terms of cause and effect.

Experimental studies Studies that have been designed according to the principles of experimental design, typically with two or more groups, and with at intent to use statistical inference in finding out cause and effect.

Administrative data Data on a business or government database that is collected as part of a customer or client database, typically used for mailing, accounting or other purposes. It is not set up for doing a controlled experiment. It typically has a lot od missing data and may not have clear definitions or validation. Is used in some studies but caution is required as it may not be fit for purpose.

Frame a list, map, or conceptual specification of the people or other units comprising the survey population from which respondents can be selected. Examples include a telephone or city directory, or a list of members of a particular association or group

The five number summary consists of the minimum and maximum of the data and the Q1, median (the same as Q2) and Q3 values.

Statistical Testing

Steps to conduct a statistical test

  1. Choose a suitable test
  2. Check the assumptions for the test of your data
  3. State the null and alternate hypotheses
  4. Calculate the test statistic by hand, or preferably use software,
  5. Check your calculated test statistic against the critical value of (at a given level of significance). Is your calculated value in the acceptance region under the null hypothesis. From your test value you have a p value and check whether it is smaller than the p=0.05. If your test statistic has a p < 0.05 then you can accept the alternate hypothesis.
  6. State your conclusion for the test

Confidence Interval (CI) of a sample mean To calculate a confidence interval you need have the values of the sample of size n, the standard error of the estimated mean, and the t value (at a given significance level) for that particular sample size. Since it is an interval estimate there is a lower bound and an upper bound on the interval. We calculate a 95% confidence interval by:

\bar{x} \ \pm \ t_{(df, {0.025})} \ {se}

The confidence interval gives us an upper and lower bound of values within which true population value falls. We typically use a 95 percent confidence interval. To interpret the true mean will lie within the 95% Confidence Interval 95 time out of a 100, or 19 in 20 if we repeated the experiment.

Coefficient of Variation The Coefficient of Variation (CV) is a ratio of the standard deviation to the mean, and is generally expressed as a percentage. So it is dimensionless and can be used to compare studies. As a measure of dispersion, it can be used to compare studies conducted in different units.

CV & = & \Bigl(\frac{s}{\bar{x}}\Bigr).100

Experimental units The individual units of an experiment to which the experimental treatments are applied. The total number of experimental units = replicates x treatments in a simple balanced one-way ANOVA.

Research Question using lay terms a one or two sentence description of the question you wish are asking in your experiment.

Statistical question Converting the research question into the statistical hypothesis, and defining carefully the null and alternate hypothesis.

Significance level The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true.

P value This is probability that is related to the test statistic, and allows us to make a decision about our hypothesis. A high p-value therefore tells us that Ho could easily be true, and a low p-value indicates we can be confident that a difference exists.

Null hypotheses H0 in a statistical test, the hypothesis which states there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.

Alternate hypothesis Ha is the hypothesis that sample observations are influenced by some non-random cause, and there are differences between the factors of interest, it states a least one mean is different.

Margin of Error In a confidence interval, the range of values above and below the sample statistic is called the margin of error.

t tests and ANOVA

t distribution A frequency distribution of a small sample, see diagram in Module 1 (link)

t statistic A t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution if the null hypothesis is supported.

t test – There are three types of t test: one sample, two sample and paired t tests. A T-test is a statistical examination of two population means. A two-sample t-test examines whether two samples are different and is commonly used when the variances of two normal distributions are unknown and when an experiment uses a small sample size.

Pooled variance (sp2) is the estimate of a pooled variance of two samples. Used if the two samples are assumed to have equal variances.

ANOVA stands for ANalysis Of VAriance. We use this method to partition our variance of an experiment to test for differences between the variance due to various sources. In a one-way ANOVA we have one factor and it is called a one way ANOVA.It is a collection of statistical models used to analyse the differences among group means and their associated procedures (such as “variation” among and between groups), developed by statistician and evolutionary biologist Ronald Fisher.

ANOVA model This is an additive model and can written as:

x_{ij} = {\mu} + t_{j} + \epsilon_{ij}

The individual value is made up of the overall experimental mean plus a treatment effect (tj) and a residual value (εij).

The ANOVA table is made up of columns for the source of variation (SOURCE), the degrees of freedom (df), the Sums of Squares (SS), the Mean Square (MS) and the F ratio (the F test). In Excel (one-way ANOVA) we get the following output.


F statistic An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.

F Calculated An F statistic is a value you get when you run an ANOVA test. The F value is calculated using the formula:

F & = & \frac{\mbox{Treatment MS}}{\mbox{Error MS}}\end{aligned}$$

Blocking in a randomized block design, there is only one primary factor under consideration in the experiment. Similar test subjects are grouped into blocks. Blocking of nuisance factors provide a way to and take out nuisance variation among the experimental units. Block designs help maintain internal validity, by reducing the possibility that observed effects are due to a confounding source of variation. Blocking is a technique for dealing with nuisance factors, and is sued in field trials, glasshouse trials and animal studies.

Factor A factor is the set of treatments of interest. For example if we have a one-way or one-factor experiment it would be looking at a number of crop varieties. The specific treatments within a Factor are called levels. A two Factor experiment would look at two factors (eg. Factor 1 Fertilizer and Factor 2 Variety).

Level The various treatments within a factor are called levels. In a variety experiment this would be the Variety names (A,B,C,D).

Mean Square Error The Mean Square Error (MSE) is a variance. This is also known as the Residual Variance. It is like a pooled variance for the experiment.

Some advanced definitions

Orthogonal contrast A contrast specified to be orthogonal and a one degree of freedom question within the treatment structure

Matrix notation for a statistical model See the  Module 7, on matrix notation and the use of matrix algebra to specify design matrix in models.

Fixed Effect An effect specified as a fixed effect- for comparing means. of a particular treatment.

Random Effect An effect specified as random, so the treatment is considered a random sample from a population of treatments (eg varieties), and we are interested in variances per se. 

Mixed models Mixed models have parts specified as fixed and part as random. They are more complex to specify.

Crossed treatment structure  A factorial design whereby the factor A is crossed by factor B. There are all combinations of A by B possible. Eg. Nitrogen by phosphorous

Nested treatment structure  A factorial design whereby the factor A is nested within  factor B. Eg if looking at varieties and maturity  Eg. Variety within Maturity