Regression analysis is a statistical process used in estimating relationships among variables. In most cases, regression analysis is used to help understand how a general value of the dependent variable changes when a single independent variable is varied (all other independent variables must remain fixed). To establish a meaningful relationship, or model of the data, it is important that only one independent variable is varied at a time. This relationship or model found is a function of the independent variables and is called the regression function. Using this function, regression analysis estimates the conditional expectation of the dependent variable given the independent variables.

Regression analysis provides the tools to effectively identify which independent variables are related to the dependent variables and to help describe the forms of these relationships. These tools are are used widely in prediction and forecasting. For the purpose of this module, you can imagine regression analysis to refer to a technique to estimate the value of dependent variables which are continuous. (For revision on variable types, continuous vs discrete, independent vs dependent etc, see module 0).

A regression model as described above is a function which estimates a value of the dependent variable. If we set all dependent variables as Y, all independent variables as X, and any unknown variables \(\beta\), then the regression model gives the expectation value of Y, given X, written \(E(X|Y)\). This maps Y to a function of X and \(\beta\), and in general, is of the form:

\begin{equation} \label{commoneq}

E(Y|X=x) \approx f(X, \beta, \epsilon)

\end{equation}

where $$ E(Y|X=x) = \textbf{Y} = \left\{ \begin{array}{c}

y_1 \\

y_2 \\

\vdots \\

y_n

\end{array} \right \}, $$

$$ \textbf{X} = \left \{ \begin{array}{c}

x_1^T \\

x_2^T \\

\vdots \\

x_n^T

\end{array}\right \} = \left \{ \begin{array}{ccc}

x_{11} & \cdots & x_{1p} \\

x_{21} & \cdots & x_{2p}\\

\vdots & \ddots & \vdots \\

x_{n1} & \cdots & x_{np}

\end{array} \right \} , \hspace{5pt} \beta = \left \{ \begin{array}{c}

\beta_1 \\

\beta_2 \\

\vdots \\

\beta_p

\end{array} \right \} $$

- \(y_i\) is referred to as the response variable, measured variable, or dependent variable.
- \(x_{i1}, x_{i2}, \dots , x_{ip}\) are reffered to as the regressors, explanatory variables, input variables, predictor variables, or independent variables.
- \(\beta\) is a parameter vector. It can also be referred to as regression coefficients.
- \(\epsilon_i\) is called the error term, disturbance term, or noise. This variable captures all other factors which influence the dependent variable \(y_i\) other than the regressors \(x_i\). The relationship between the error term and the regressors, for example whether they are correlated, is a crucial step in formulating a linear regression model, as it determines the appropriate regression model to use.

In order to complete the regression analysis, the general form of this function f must be found. Often previous knowledge about the relationship between the variables will make this clear. If no prior knowledge is known, then different models can be tested to see which one is most appropriate. Some of the more common regression models, along with the assumptions needed to use them which will be discussed are Linear Regression, Logistic Regression, and Ordinary Least Squares.

### Linear Regression

Linear regression is one of the most studied types of regression analysis due to its wide range of applicability to practical applications. Linear regression models are generally used for predicting values of y given values of x, however they can also be used to determine the strength of the relationship between x and y, and which values of x, or which independent variables should be left out of the model.

Suppose we have a data set of \(\{y_i, x_{i1}, x_{i2}, \dots x_{ip} \}^{n}_{i=1} \) of n statistical units, where \(y_i\) is the dependent variable and \(x_i’s\) are the dependent variables. A linear regression model means the relationship between \(y_i\), and the p-vector of the regressors \(x_i\) is linear. This relationship is modeled through a disturbance term, or error variable \(\epsilon_i\). This model takes the form of:

$$

y = \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i \Rightarrow Y = x_i^T\beta_i + \epsilon_i, \hspace{15pt} i = 1, \dots , n

$$

This leads to the equations of the form

$$

E(Y|X=x) = X\beta + \epsilon, \hspace{10pt} \text{Var}(Y|X=x) = \sigma^2 \\

$$

where \(Y\), \(X\) and \(\epsilon\) are given in equation \ref{commoneq}.

In order for data to be successfully modelled by a linear regression, certain assumptions must be made.

**Linearity**. This assumes that the variables are linearly related to each other.**Independence of Error**. This assumes that the errors (difference between measured dependent variable and the expected value of the regression function corresponding to the same independent variables) associated with each measurement or data point of the dependent variable are not correlated and random**Constant variance (homoscedasticity)**. This implies the measured dependent variables have the same variance in their errors, regardless of which independent variables were chosen.**Variability of the residuals fit a normal distribution**.

If there is only one independent variable then the process is called \textbf{simple linear regression}, while if there exist more than 1 independent variable then its referred to as **multiple linear regression**.

### Logistic Regression

Logistic Regression is regression model where the dependent variable is categorical. Logistic regression usually refers to the case when the response variable is binary, that is, can only take on two values, 0 or 1. Estimating models when the response variable can take on more than two values are referred to as multinomial logistic regression, or if multiple categories are ordered, then its referred to as ordinal logistic regression. Logistic models are used to predict a binary response dependent on one or more explanatory, or in this case, predictor variables. They measure a relationship by estimating the probabilities of the response variable being either of the two binary values using a logistic function. Logistic regression is similar to linear regression, as they both models estimate a conditional probability using a regression function, however the underlying assumptions highly differ in the models. The conditional distribution is no longer calculated as a familiar Gaussian (normal) distribution like a linear model, but on a Bernoulli distribution as the response variable is binary. Also, the predicted values are probabilities, and thus constricted to values within the range of [0,1].

A study by Santiago López, showed that agricultural expansion in an area due to human settlement and population pressure can be described through multivariate logistic models. The model showed that the probability that a segment of forest in the area would be replaced by agricultural land use increased with an increase in human population of the area., A figure displaying the outcome is seen in Figure \ref{Log}.

\begin{figure}[!h]

\centering

\includegraphics[scale = 0.8]{Log}

\caption{Distance of Land converted to agriculture (x) vs population density per square km (y).} \label{Log}

\end{figure}

\FloatBarrier

\subsection{Ordinary Least Squares Estimation}

The Least squares estimation is used to determine the coefficients ($\beta_i$) in the regression model. This technique aims to find the best estimate by minimizing the sum of the squared errors ($\epsilon_i^T * \epsilon = \epsilon_i^2$). This estimate gives an estimate for $\beta = \left \{ \begin{array}{c}

\beta_1 \\

\beta_2 \\

\vdots \\

\beta_p

\end{array} \right \} $, which we will call $\hat{\beta}$. For those interested in the derivation of the formula, we achieve the result by minimizing the following:

\begin{align*}

\sum{(y_i)}_{i = 1}^{n} &= \sum{(x_i^T \beta_i + \epsilon_i)}_{i = 1}^{i = n} = X \beta + \epsilon \Rightarrow \epsilon = Y – X \beta \nonumber \\

\sum{\epsilon^2} &= \epsilon^T \epsilon = (Y-X\beta)^T (Y-X\beta) \hspace{15pt} \text{when expanding this out, we get:} \\

Y^TY &- 2 \beta X^TY + \beta^TX^TX\beta \hspace{15pt} \text{by differentiating and setting to zero, we find} \hspace{5pt} \hat{\beta}: \\

X^TX\hat{\beta} &= X^TY, \hspace{15pt} \text{Provided $X^TX$ is invertible,} \\

\hat{\beta} &= (X^TX)^{-1} X^TY \\

X\hat{\beta} &= X(X^TX)^{-1} X^TY \\

&= HY \hspace{15pt} \text{Where H is the hat matrix, and} \hspace{5pt} H = X(X^TX)^{-1} X^T \\

\end{align*}

Now usually this H matrix is not computed by hand as its an n by n matrix. However its useful for theoretical manipulations. \\

\begin{itemize}

\item[\textbf{Predicted Values:}] \begin{align*} \hat{Y} = HY = X\hat{\beta}. \end{align*} \\

\item[\textbf{Residuals:}] \begin{align*} \hat{\epsilon} = Y-X \beta = Y- \hat{Y} = (1-H)Y \end{align*} \\

\item[\textbf{Residual Sum of Squares:}] \begin{align*} \hat{\epsilon}^T\hat{\epsilon} = Y^T(1-H)(1-H)Y = Y^T(1-H)Y \end{align*}

\end{itemize}

This method is he best possible method to find the coefficients $\beta_i$ when the errors $\epsilon_i$ are uncorrelated and have equal variance

\section{Utilising R for Regression}

\begin{itemize} \item[1.] There are many different ways to import and veiw data in R. If one wishes to load data from a spreadsheet into R, firstly ensure the spreadsheet is is saved as a .csv file. Lets use the attached file called CropYield.csv \\

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{import}

\end{figure}

\FloatBarrier

\item[2.] In the top right panel, click Import Dataset from Text File under the environment tab. Then find CropYield.csv. If needed adjust the options until the dataset looks correct, then click Import. Once this data set has been loaded, rainfall.csv will appear in your workspace on the left. In the top left panel, click CropYield.csv.

\end{itemize}

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 0.4]{import2}

\end{figure}

\FloatBarrier

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 0.3]{import3}

\end{figure}

\FloatBarrier

If you wish to manually type in the data, firstly define a new variable (can be anything, in the case below its called ‘randomvariables’), and type the data in brackets separated by commas as shown below, with the letter ‘c` in front. Note typing ‘c` tells R to create this as a set of numbers. \\

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{random}

\end{figure}

\FloatBarrier

\subsection{Simple Linear Regression}

Following are some of the functions you will need to be familiar with in R in order to calculate regression.

Firstly we must plot the data sets of interest. Lets say we had a dataset with two variables called ‘Height’ and ‘Forest”, where ‘Height’ was the dependent or response variable, to plot would type the code: \\

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{plot1}

\end{figure}

\FloatBarrier

Also note that plot(Height, Forest) also performs the same plot. After observing your model, determine if the trend seems to be approximately linear, and without any extreme outliers. Also check that the variability seems constant as Height (the response or the dependent variable) increases or decreases. If these conditions seem to visually hold, we can now set the model as a linear model (lm) and save it under a name, lets call it model1. So you would type: \\

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{model1}

\end{figure}

\FloatBarrier

Now that the model has been fitted, we can obtain a summary of the model to observe the output which describes relationships between the variables.

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{sum}

\end{figure}

\FloatBarrier

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 0.6]{linebest}

\caption{A set of data points fitted with a linear Regression model.}

\end{figure}

\FloatBarrier

\subsection{Multiple Linear Regression}

When using more than one independent variable, the same process is used as described above, however when fitting the model, slightly different syntax is used. Suppose we have three independent variables, $x_1$, $x_2$, and $x_3$. then to create a linear model called, lets say $\text{Model}_1$, the the following command would be used:

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{model2}

\FloatBarrier

\end{figure}

After the model has an appropriate fit, to see the models output, the summary command can again be used.

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{m3}

\FloatBarrier

\end{figure}

\subsection{Interpretation of Output}

When using R, understanding the output from the functions you use is perhaps the most important skill to aquire. To begin with, lets observe the common `summary’ command. The summary() command is a quick way to get the usual univariate summary information. Lets use the summary command on the above CropYield.csv file. Firstly, we begin by plotting the data to observe if a linear fit is appropriate.

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{plot}

\FloatBarrier

\end{figure}

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 0.6]{plotCr}

\FloatBarrier

\end{figure}

The data looks closely correlated enough to begin with a liner model. Lets call this model Crop.lm.

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{crop5}

\FloatBarrier

\end{figure}

When this command is used, the following data is produced:

\begin{itemize} \item[1.] Variables which were used to create the model, are shown under the heading “Call”. \\ \item[2.] A summary of the residuals in all the quartiles are displayed, along with the minimum and maximum values. A breif check is necessary to see that the residuals are normally distrubuted. That is, that the mean is close to zero and the quartiles either side of the mean, Q1 and Q3 have roughly the same magnitude. \\ \item[3.] Under the heading “Coefficients”, the estimate for the intercept and slope are given along with their respective standard errors. The residual standard error is an estimate of the parameter $\sigma$. The assumption in an ordinary least squares approximation (which is what R usues as defult for linear regression) is that the residuals are individually described by a Gaussian (normal) distribution with mean 0 and standard deviation $\sigma$. The $\sigma$ relates to the constant variance assumption; each residual has the same variance and that variance is equal to $ \sigma^2$. The $R^2$ coefficient is a statistical measure of how well the regression line approximates the real data points. The adjusted $R^2$ is the same thing as $R^2$, but adjusted for the complexity of the model, i.e. the number of parameters in the model. If we have a model with a single parameter, it will have a certain $R^2$. If we add another parameter to this model, the $R^2$ of the new model has to increase, even if the added parameter has no statistical power. The adjusted $R^2$ tries to account for this, by including information on the number of parameters in the model. The t-value shown for both the slope and intercept is the estimate provided for that co-efficient divided by the standard error. The p-value is the probability of achieving a value of t as larger or larger than the one shown if the null hypothesis were true. The null hypothesis is that all the coefficients are 0. Usually, this will be very small. If the p-value is small (Usually small means less than 0.05) it demonstrates your model is sufficient and it is now logical to add a regression line to see how closely it fits our data: \\ \end{itemize}

\begin{center} abline(mod1) \end{center}

\subsection{ANOVA for Regression}

ANOVA, short for Analysis of Variance, provides a statistical test of whether or not the means of several groups are equal.

Suppose in R our x-values (explanatory variables) are within the variable Xdata, and our y-values (response variables) are given by the variable Ydata. To obtain the ANOVA values for the data, along with other useful functions useful for analysis, the following steps are taken: \\

\noindent \textbf{ANOVA table:}

\begin{itemize} \item Fit the data to a model. If its a linear model: SomeModel = lm(Xdata $\sim$ Xdata)

\item Now to produce the ANOVA table: anova(SomeModel) \end{itemize}

\noindent \textbf{Fitted Values}

\begin{itemize} \item Using the same example as above, so obtain fitted values of the model: SomeFit = fitted(SomeModel)

\item This produces an array (similar to a list) of data called `SomeFit’ containing the fitted values of SomeModel. \end{itemize}

\noindent \textbf{Residuals}

\begin{itemize} \item to obtain the residuals of the model: ModelRes = resid(SomeModel)

\item This produces an array called `ModelRes’ containing all the data of the residuals from SomeModel. \end{itemize}

\noindent \textbf{Hypothesis Testing}

\begin{itemize} \item The test statistic should already be found from the data produced on the ANOVA table.

\item Suppose you want to observe the F-quartile given by F(u,v,w) where u,v, and w are some numbers of interest. Then the command is qf(u,w,v)

\item to observe the t-quartile given by (a,b,c) where a,b, and c are some numbers of interst, the command is qt(a,b,c) \end{itemize}

\noindent \textbf{P-values}

\begin{itemize} \item To obtain the p-value for the F-quartile of k, with degrees of freedom j and l, where k is some number and j and l are numbers representing the degrees of freedom, type pf(k,j,l) \end{itemize}

\noindent \textbf{Normal Q-Q plot }

\begin{itemize}

\item Suppose we need the Normal Probability plot for the standardized residuals of the data SomeModel.

\item We create a new variable to hold the studentized residuals: Model1-Stud = rstandard(SomeModel1)

\item To make the plot, type: qqnorm(Model1-Stud)

\item And finally to make the line appear: qqline(Model1-Stud)

\end{itemize}

\subsection{Slopes and Confidence Intervals for slopes}

\subsubsection{Goodness of Fit / What to watch out for when using R squared values.}

The most common measure of goodness of fit is the R squared value. This is referred to as the co-efficent of determination, or percentage of variance explained. It is formally defined as:

\begin{align*} R^2 = 1 – \frac{\sum{(\hat{y}_i – y_i)^2}}{\sum{(y_i – \bar{y}^2})} \end{align*}

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 1]{Rsquaredhelp}

\caption{Variation in the response y ($\bar{y}$) when x is known shown by dotted arrows while variation in y when x is unknown, shown by solid arrows.} \label{Rhelp}

\end{figure}

\FloatBarrier

Figure \ref{Rhelp} shows the construction of the R squared value in an intuitive manor. If you want to predict y, and you don’t know the value of x, then your prediction ranges over the possible values of y. However if you do know the value of x, then your prediction depends on the regression fit, or the accuracy of the regression function. The stronger the relationship between x and y, the more accurate your prediction will be. Therefore, for a perfect prediction, the ratio between the smaller arrow and the larger arrow in figure \ref{Rhelp} will be zero, (the smaller arrow will be a dot of length zero) and the R squared value will be 1.

It is very important to not put too much weight on the reliability of your R squared value. Figure \ref{Rwr} is a perfect example on why the R squared value can be misleading. In the picture, all four cases the mean of the x-values is 9, and they all have a variance of 11. The mean of the y-values is 7.5, and all have an identical variance of 4.125. They all also have an R squared value of 0.816 when plotted with the regression function y = 3 + 0.5x. This is why observing a scatter plot of the data when comping it to an R squared value is imperative.

\FloatBarrier

\begin{figure}[!h]

\centering

\includegraphics[scale = 0.5]{Rsqwrong}

\caption{Linear regression models, each with an $R^2$ value of 0.816, but each data set showing different trends.} \label{Rwr}

\end{figure}

\FloatBarrier

On the other hand, depending on the application, an R squared value need not be too high to show evidence of some correlation. In biological and social sciences,

variables tend to be more weakly correlated, so a low R squared value might be acceptable. While in physics and engineering, the data comes from closely controlled experiments, so a higher R squared value is expected and a value of 0.7 may be considered very poor. Knowledge in the application being tested, along with a scatter-plot of the data is what will ensure that the correct assumptions are made.

\subsection{Residuals}

Residuals plots are one of the most powerful tools used in the validation of a regression model. They allow the user to assess visually whether an appropriate model has been fit regardless of how many predictor variables have been used. The first and easiest way to check whether a regression model has been fitted correctly is to plot the residual plots vs x and try and identify any patterns. If no patterns exist, then the fit is adequate. If a pattern does exist, then an incorrect model has been used. To display why this is the case, suppose the correct regression model resembles a straight line. Then,

\begin{align*}

Y_i = E(Y_i|X_i = x_i) + \epsilon_i = \beta_0 + \beta_1 x_i + \epsilon_i

\end{align*}

where $\epsilon_i$ is the random error in $Y_i$. Then if the assumption that the least squares estimate of $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i$ are close to the unknown population parameters of $\beta_0 and \beta_1$, its clear that:

\begin{align*}

\hat{e}_i = y_i – \hat{y}_i = ( \beta_0 – \hat{\beta}_0) + ( \beta_1 – \hat{\beta}_1 )x_i + \epsilon_i \approx e_i \\

\end{align*}

Therefore, the residuals should resemble random errors, and anything that suggests otherwise means an incorrect model fit has been used.

Calculate residuals – `model$residuals`

Residual plot – `plot(predict(model), model$residuals)`

Histogram of residuals – `hist(model\$residuals)`

## Leave a Reply