## This Module will assist you in understanding the technique of linear regression.

## Objectives of module 3

At the end of this module you will be familiar with the technique of regression. You should understand the research questions to which it is relevant, the assumptions for the analysis, the analysis and plotting the regression line. We also cover methods and measures to assess the fit and describe the coefficients with reliability. Also we will plot the fitted lines and confidence intervals. Finally you will have the skills to summarise the analysis for a report.

## Contents

- Terminology to be familiar with
- Introduction to simple linear regression
- The regression model and ANOVA for regression
- Use of the regression equation
- Describing the coefficients for the line
- The coefficients and their standard errors

## Terminology you will need to be familiar with:

Response variable, independent variable, regression coefficients, slope, intercept, R squared, residual analysis, standard errors of coefficients and confidence intervals of coefficients, model fit.

Look at the FAQ or Technical Section for more information.

In Module 2 you learnt about Analysis of Variation (ANOVA), where the data being modelled had (mainly) categorical explanatory variables and a quantitative response variable. When looking at data which has quantitative explanatory variables as well a quantitative response variable we use a technique called **Regression**. The most common form of regression used is Linear Regression, due to the simplicity of the fitting of a line. Nonlinear Regression is generally harder to fit to a data set, and so transformations are made to return to the linear model. Nonlinear Regression will be discussed in a future module.

## Visualising quantitative variables

In this module the following data set on Phosphorus levels available to plants will be used. A good idea when working with data is first to create a table of metadata, identifying the role of each variable and its type. The variables in this data set are:

Name | Description | Role | Type |
---|---|---|---|

Soil | Soil Sample | Index |
Nominal |

InP | Inorganic phosphorous | Explanatory |
Continuous |

OrgP | Organic Phosphorous | Explanatory |
Continuous |

AvailP | estimated plant available P | Response |
Continuous |

After identifying the variable roles, the next task is to read in the data set into RStudio. For this we will be using the `XLConnect`

library that allows you to import data directly from an excel worksheet. The commands to perform this operation are:

^{?}View Code RSPLUS

library(XLConnect) Phwb < - loadWorkbook("Phosphorous.xlsx") Phosphorous<-readWorksheet(Phwb,sheet=1) |

Here the required library is loaded using the command `library(XLConnect)`

, then the workbook is loaded using the command `loadWorkbook`

where `Phosphorous.xlsx`

is the name of the workbook and `Phwb`

is the data frame used to store the whole workbook. `XLConnect`

allows R to read parts of the workbook to extract the data set. Sheet 1 of the workbook contains the data set that will be used in the module and to read this in use the command `readWorksheet`

.

At this point we should load the other libraries that we will be using for the module, in the following way

^{?}View Code RSPLUS

Having loaded this data set we can now go on to visualise the interactions of the variables with each other. The simplest way to do this is to plot a scatter plot matrix of all the variables. To do this we use the command `pairs(Phosphorous[2:4],upper.panel=NULL)`

. The function pairs simply produces a scatter plot for each pair of variables, and the flag upper.panel=NULL means not to plot the lower plots (as they do not have AvalP as the response variable). We are limiting the data set to the second through fourth variables as the id variable has no use here. The plot that is produced is:

The scatterplot matrix is easy to read, with the column variable being the independent (explanatory) variable and the row variable being the dependent (response) variable. Looking at the above plot we can see a clear positive relationship between AvailP and InP along with a relationship between InP and OrgP. The scatter plot of AvailP versus OrgP looks to be fairly spread out and there appears not to be a relationship between these variables.

## Motivation for linear regression

So, having identified that there may be a relationship between AvailP and InP (we will come back to InP versus OrgP later) can we actually quantify this relationship. The simplest method to determine if there is a linear relationship between two variables is to calculate the correlation between the variables. As we are looking for linear models in this module, we will use the Pearson’s Correlation coefficient. The formula to calculate this coefficient is:

\begin{equation*}

r=\dfrac{1}{n-1}\sum{\left(\dfrac{x_j-\bar{x}}{s_x}\right)\left(\dfrac{y_j-\bar{y}}{s_y}\right)},

\end{equation*}

where \(n\) is the number of data points, \(x_i\) is the values of the explanatory variable, \(y_i\) is the value of the response variable, \(s_i\) is the sample standard deviation of the variable \(i\) and \(\bar{i}\) is the sample mean of the variable \(i\). In R, we can calculate a correlation matrix of the correlations between the pairs of variables.

^{?}View Code RSPLUS

cor(Phosphorous[2:4],method="pearson") ## InP OrgP AvailP ## InP 1.0000000 0.3989231 0.7200866 ## OrgP 0.3989231 1.0000000 0.2118376 ## AvailP 0.7200866 0.2118376 1.0000000 |

We can present this visually using the command `corrplot.mixed(cor(Phosphorous[2:4],method="pearson"))`

, and get the image below:

As we can see by the large circle that there is a good correlation between InP and AvailP at 0.72, and a weaker correlation between InP and OrgP at 0.4.

To bring some statistical rigour to our conclusion we can perform a hypothesis test of the Pearson correlation coefficient \(r \) to see that it is significantly different from zero. Here our hypotheses are:

** H_{0}:**The correlation coefficient between variable 1 and variable 2 is equal to zero, \(r=0 \)

**The correlation coefficient between variable 1 and variable 2 is not equal to zero, \(r\neq 0\)**

*H*_{1}:The test statistic used for this test is calculated by

\begin{equation*}

t_{n-2}=\frac{r-0}{\sqrt{\frac{1-r^2}{n-2}}}

\end{equation*}

where \(t_{n-2} \) has Student’s t-distribution with degrees of freedom \(n-2 \). As we have 17 observations in our data set we will have 15 degrees of freedom.

We can get R to perform the calculations for the hypothesis tests using the function `cor.test(variable 1,variable 2, alternative="two.sided",method="pearson")`

. First we will look at the correlation between InP and AvailP. The results are shown below.

^{?}View Code RSPLUS

cor.test(Phosphorous$InP,Phosphorous$AvailP,alternative="two.sided",method="pearson") ## ## Pearson's product-moment correlation ## ## data: Phosphorous$InP and Phosphorous$AvailP ## t = 4.0192, df = 15, p-value = 0.001115 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.3661783 0.8920037 ## sample estimates: ## cor ## 0.7200866 |

The *p*-value obtained is \(p=0.001115\), (\(t_{15}=4.0192\)), which shows strong evidence to reject the null hypothesis. Therefore we can conclude that the correlation coefficient between InP and AvailP is significantly different to zero, suggesting there is a significant linear relationship present. From the results, the 95% confidence interval for the coefficient is between 0.3661783 and 0.8920037, which does not contain zero and therefore supports our results.

When we go on to test the correlation between InP and OrgP we obtain the following results

^{?}View Code RSPLUS

cor.test(Phosphorous$InP,Phosphorous$OrgP,alternative = "two.sided",method="pearson") ## ## Pearson's product-moment correlation ## ## data: Phosphorous$InP and Phosphorous$OrgP ## t = 1.6849, df = 15, p-value = 0.1127 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.1011081 0.7380533 ## sample estimates: ## cor ## 0.3989231 |

Here we obtain a *p*-value of \(p=0.1127\), (\(t_{15}=1.6849\)), which shows very weak evidence to reject the null hypothesis at the 5% significance level. Therefore we conclude that the correlation between InP and OrgP is not significantly different from zero. Again the 95% confidence interval confirms our results as it contains zero. This conclusion is a bit of a mouthful, but is stating that we do not have evidence to support the alternative hypothesis that the correlation is significantly different from zero, while equally saying that it may not be observed to be zero.

Finally we will perform the hypothesis test on the correlation of OrgP and AvailP for completeness.

^{?}View Code RSPLUS

cor.test(Phosphorous$OrgP,Phosphorous$AvailP,alternative="two.sided",method="pearson") ## ## Pearson's product-moment correlation ## ## data: Phosphorous$OrgP and Phosphorous$AvailP ## t = 0.8395, df = 15, p-value = 0.4144 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.2992794 0.6284903 ## sample estimates: ## cor ## 0.2118376 |

The *p*-value we obtain from this test is \(p=0.4144\), (\(t_{15}=0.8395\)), which shows very little evidence to reject the null hypothesis. Similarly to the previous test we conclude that the correlation between OrgP and AvailP is not significantly different from zero.

### Just because it has a high correlation…

One common mistake people make while looking at correlation is to use variables which in context would not be expected to be related. The old phrase “Correlations does not imply causation” is used to describe situations such as these. So to demonstrate this point we will use an example from the entertaining site Spurious Correlations (we prefer the old version of this site as it gives the data set used). Here we will be looking at the correlation between the Number of people who died by becoming tangled in their bedsheets in the USA and Total revenue generated by skiing facilities in the USA. When we visualise the data we get the following plot

As we can see there is a strong linear relationship here. When we calculated the Pearson Correlation Coefficient we obtain a value of 0.9697241. This suggests a strong correlation between these variables, but in context of the real world what does a relationship between these variables actually mean? We intuitively know that there is no direct link between the money a skiing facility makes and people dying in bedsheets. So as with ANOVA we have to be careful in the variables we are using for our model.

## Introduction to Linear Regression

From the analysis above we have shown that InP and AvailP are linearly correlated to each other. However while this is an important property, it does allow us to describe relationship between the variable further than them being linearly related. We would ideally like to create a model that will allow us to estimate the AvailP value given a value for InP. So we need to perform a linear regression.

Regression analysis is a statistical process used in estimating relationships among variables. In most cases, regression analysis is used to help understand how a general value of the dependent variable changes when a single independent variable is varied (all other independent variables must remain fixed). To establish a meaningful relationship, or model of the data, it is important that only one independent variable is varied at a time. This relationship or model developed is a function of the independent variables and is called the regression function.

Regression analysis provides the tools to effectively identify which independent variables are related to the dependent variables and to help describe the forms of these relationships. These tools are are used widely in prediction and forecasting.

Click here and watch this Youtube on linear regression and using R.

This powerpoint gives you an overview of linear regression Presentation on Regression

Returning to our example we can draw a regression line on our scatterplot using the command `xyplot(AvailP~InP,data=Phosphorous,type=c("p","r"))`

It is the aim of our regression analysis to find the equation that describes the line shown.

^{?}View Code RSPLUS

model1=lm(AvailP~InP,data=Phosphorous) summary(model1) ## ## Call: ## lm(formula = AvailP ~ InP, data = Phosphorous) ## ## Residuals: ## Min 1Q Median 3Q Max ## -27.0563 -3.0611 0.9389 5.0380 18.0165 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 62.5694 4.4519 14.055 4.85e-10 *** ## InP 1.2291 0.3058 4.019 0.00111 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.92 on 15 degrees of freedom ## Multiple R-squared: 0.5185, Adjusted R-squared: 0.4864 ## F-statistic: 16.15 on 1 and 15 DF, p-value: 0.001115 anova(model1) ## Analysis of Variance Table ## ## Response: AvailP ## Df Sum Sq Mean Sq F value Pr(>F) ## InP 1 2295.2 2295.23 16.154 0.001115 ** ## Residuals 15 2131.2 142.08 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |

### Regression Models

A regression model as described above is a function which estimates a value of the dependent variable. If we set all dependent variables as Y, all independent variables as X, and any unknown variables \(\beta\), then the regression model gives the expectation value of Y, given X, written \(E(X|Y)\). This maps Y to a function of X and \(\beta\), and in general, is of the form:

\begin{equation} \label{commoneq}

E(Y|X=x) \approx f(X, \beta, \epsilon)

\end{equation}

where $$ E(Y|X=x) = \textbf{Y} = \left\{ \begin{array}{c}

y_1 \\

y_2 \\

\vdots \\

y_n

\end{array} \right \}, $$

$$ \textbf{X} = \left \{ \begin{array}{ccc}

x_{11} & \cdots & x_{1p} \\

x_{21} & \cdots & x_{2p}\\

\vdots & \ddots & \vdots \\

x_{n1} & \cdots & x_{np}

\end{array} \right \} , \hspace{5pt} \beta = \left \{ \begin{array}{c}

\beta_1 \\

\beta_2 \\

\vdots \\

\beta_p

\end{array} \right \} $$

- \(y_i\) is referred to as the response variable, measured variable, or dependent variable.
- \(x_{i1}, x_{i2}, \dots , x_{ip}\) are reffered to as the regressors, explanatory variables, input variables, predictor variables, or independent variables.
- \(\beta\) is a parameter vector. It can also be referred to as regression coefficients.
- \(\epsilon_i\) is called the error term, disturbance term, or noise. This variable captures all other factors which influence the dependent variable \(y_i\) other than the regressors \(x_i\). The relationship between the error term and the regressors, for example whether they are correlated, is a crucial step in formulating a linear regression model, as it determines the appropriate regression model to use.

In order to complete the regression analysis, the general form of this function f must be found. Often previous knowledge about the relationship between the variables will make this clear. If no prior knowledge is known, then different models can be tested to see which one is most appropriate. Some of the more common regression models, along with the assumptions needed to use them which will be discussed are Linear Regression, Logistic Regression, and Ordinary Least Squares.

## The regression model and ANOVA for regression explained

## Describing the coefficients for the line

How we get the coefficients out of the analysis.

How can we use these coefficients

Use the equation

What is extrapolation, how far can we go in using this equation

## The coefficients and their standard errors

Estimate of Confidence interval for the fitted values

Prediction interval for a new value

## An introduction to multiple regression

This method computes the least squares fit linear function of two or more independent variables.

We estimate an intercept and slopes for each independent variable

Convert to maths \(\beta_o\) , \(\beta_1\) and up to \(\beta_n\)

The assumptions for multiple regression

^{?}View Code RSPLUS

model2=lm(AvailP~InP*OrgP,data=Phosphorous) summary(model2) ## ## Call: ## lm(formula = AvailP ~ InP * OrgP, data = Phosphorous) ## ## Residuals: ## Min 1Q Median 3Q Max ## -20.042 -6.353 2.443 5.180 15.687 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 45.92323 12.23633 3.753 0.00241 ** ## InP 5.30367 1.73357 3.059 0.00913 ** ## OrgP 0.32783 0.28562 1.148 0.27174 ## InP:OrgP -0.08309 0.03536 -2.350 0.03522 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.65 on 13 degrees of freedom ## Multiple R-squared: 0.6668, Adjusted R-squared: 0.59 ## F-statistic: 8.673 on 3 and 13 DF, p-value: 0.002022 anova(model2) ## Analysis of Variance Table ## ## Response: AvailP ## Df Sum Sq Mean Sq F value Pr(>F) ## InP 1 2295.23 2295.23 20.2328 0.0005994 *** ## OrgP 1 29.95 29.95 0.2640 0.6160271 ## InP:OrgP 1 626.55 626.55 5.5231 0.0352174 * ## Residuals 13 1474.74 113.44 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |

## Methods for undertaking multiple regression

Strategy for examining variables to include in multiple regression

- Look at the distributions of each variable, check if it is normal
- Consider if you need to transform a variable
- Check for correlation among variables (collinearity)

Methods

- Enter all variables in the model
- Stepwise entry of variables
- Backwards elimination of variables
- Comparing two models in R

## Examining model fit and regression diagnostics

- Plot the fitted values versus the residuals, look for any patterns
- QQ plot of fitted v residuals
- Adjusted R squared (probably not very useful)
- Calculating AIC
- Influential values, such as Cooks distance, Leverage

## Leave a Reply