6/4/2020
The simple linear regression model consists of the mean function \[E(Y│X=x)=\beta_0+\beta_1 x,\]
and variance function
\[var(Y│X=x)=\sigma^2,\] where:



Assume that we have n observations or cases for which the response and predictor variables are measured.
Each response will deviate from its associated mean, so our statistical model must include an additional source of variation.
The statistical model for each response is \[y_i=\beta_0+\beta_1 x_i+\epsilon_i,\quad i=1,2,\dots,n,\]
where \(\epsilon_i\) denotes the deviation of \(y_i\) from its mean.
Conditional on knowing the regressor values, the errors have:
Mathematically, this is the same as:
Mean:
\[E[Y_i|X_i] = E[\beta_0+\beta_1X_i + \epsilon_i]= \beta_0 + \beta_1X_i\] \[Var(Y_i|X_i) = Var(\beta_0+\beta_1X_i + \epsilon_i) = Var(\epsilon_i) = \sigma^2\] \[cov(Y_i, Y_j) = 0,\quad i \neq j\] The responses \(Y_i\) come from probability distributions whose means are \(\beta_0+\beta_1X_i\) and whose variances are \(\sigma^2\), the same for all levels of \(X\). Further, any two responses \(Y_i\) and \(Y_j\) are uncorrelated.

#OLS
Ordinary least squares is the method most commonly used for estimating the regression parameters.
The fitted value for observation i is the estimated mean response at the observed values of the regressor variables, and is given by \[\hat{y}_i= \hat{E}(Y│X=x_i )=\hat{\beta}_0+\hat{\beta}_1 x_i.\]
The \(\hat{}\) (hat) over \(\beta_0\) and \(\beta_1\) means that these are the estimated values of the parameters.
The fitted line is \[\hat{Y} =\hat{E} (Y│X)=\hat{\beta}_0+\hat{\beta}_1 X.\]
The residual for observation i is the difference between its response and fitted value, i.e.,
\[\hat{\epsilon}_i = y_i -\hat{y}_i\] The residual sum of squares (RSS), is the sum of the squared residuals, i.e., \[RSS = \sum_{i=1}^n \hat{\epsilon}_i^2 = \sum_{i=1}^n (y_i-\hat{y}_i)^2 = \sum_{i=1}^n\left(y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i)\right)^2.\]
OLS estimation estimates \(\beta_0\) and \(\beta_1\) with the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the RSS.
\[\hat{\beta}_1 = \frac{\sum(x_i-\overline{x})(y_i-\overline{y})}{\sum(x_i-\overline{x})^2}\]
\[\hat{\beta}_0 = \overline{y} - \hat{\beta}_1\overline{x}\]

James D. Forbes (1809-1868) performed a series of experiments to study the relationship between atmospheric pressure and the boiling point of water. He provided 17 pairs of measurements of atmospheric pressure in inches of mercury and boiling point in degrees Fahrenheit. 
Regression pressure on boiling point temperature we obtain the fitted line \[\hat{Y} =\hat{E}(press|BP) = -81.06 + 0.52 BP\] The estimated intercept \(\hat{\beta}_0 = -81.06\), which is the esimated value of pressure when BP = 0.
The estimated slope is \(\hat{\beta} = 0.52\), meaning that for each degree increase in boiling point, we expect pressure to increase by 0.52 units.
Show that \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are unbiased estimators of \(\beta_0\) and \(\beta_1\) respectively.


A very general way to model the response \(Y\) in terms of three regressors, \(X_1, X_2\), and \(X_3\) is
\(Y=f(X_1,X_2,X_3 )+\epsilon\),
where: - \(f\) is an unknown function - \(\epsilon\) is the error.
If the function \(f\) is too general, we have no hope of estimating it.
Typically, we assume a linear model, which (in this case) means
\[Y=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\epsilon.\]
The linear model assumption reduces the problem to the estimation of four parameters instead of an infinite-dimensional \(f\).
A linear model has the parameters enter the model linearly; the regressors do not need to be linear.
Linear model: \[Y=\beta_0+\beta_1 X_1+\beta_2 \ln (X_2)+\beta_3 X_1 X_2+\epsilon\]
Non-linear model: \[Y=\beta_0+\beta_1 X_1^{\beta_2 }+\epsilon\]
Some models are linearalizable, which means they become a linear model after a transformation, e.g., apply the ln function to \(Y=\beta_0 X_1^(\beta_1 ) \epsilon\).
Linear models are very flexible because the predictors can be transformed and combined in any way.

The regression model has p-1 predictor variables.
\[Y=\beta_0+\beta_1 X_1+\dots+\beta_{p-1} X_{p-1}+\epsilon.\]
If we have a response \(Y\) and three regressors, \(X_1, X_2\), and \(X_3\), we might present the data in tabular form as \[\begin{matrix} y_1 & x_{11} & x_{12} & x_{13}\\ y_2 & x_{21} & x_{22} & x_{23}\\ \dots & \dots & \dots & \dots \\ y_n & x_{n1} & x_{n2} & x_{n3} \end{matrix}\]
where \(n\) is the number of observations (or cases), in the dataset, \(y_1,\dots ,y_n\) are the n responses, and \(x_{ij}\) denotes the value of the \(j\)th regressor variable for observation \(i\).
Our system of regression equations can be written as
\[y_i=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\beta_3 x_{i3}+\epsilon_i,\qquad i=1,…,n.\]
It is simpler to write the regression equation as
\[y=X\beta+\epsilon,\]
with \(y=(y_1,y_2,…,y_n )^T,\ \beta=(\beta_0,\beta_1,\beta_2,\beta_3 )^T,\ \epsilon=(\epsilon_1,\epsilon_2,…,\epsilon_n )^T\), and
\[\begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & x_{21} & x_{22} & x_{23}\\ \dots & \dots & \dots & \dots \\ 1 & x_{n1} & x_{n2} & x_{n3} \end{pmatrix}\]
In practice, we would have:
The column of ones is related to the intercept term.
We can always assume \(E(\epsilon|X)=0\) since, if this is not the case, we can always absorb it into the mean part of the model so that the errors have a zero expectation.
The regression model, \(y=X\beta+\epsilon\), partitions the response into:
We would like to choose \(\beta\) so that the systematic part explains as much of the response as possible, i.e., minimize the errors \(\epsilon\) in some way.
The regression parameters are most commonly estimated by finding the β that minimizes the sum of the squared errors.
The least squares estimator of \(\beta\), denoted \(\hat{\beta}\) minimizes
\[Q=\sum \epsilon_i^2=\epsilon^T \epsilon=(y-X\beta)^T (y-X\beta).\]
How do we find the value of \(\beta\) that minimizes \(Q\)?
Simplify \(Q\)
\[Q=y^T y-2\beta^T X^T y+\beta^T X^T X\beta\] Differentiate with respect to \(\beta\) \[\frac{\partial Q}{\partial \beta}=-2X^T y+2X^T X\beta.\]
To minimize Q, we set the derivative equal to zero and simplify.
The normal equations are \(X^T X\beta=X^T y\).
Provided that \(X^T X\) is invertible:
\[\hat{\beta}=(X^T X)^{-1} X^T y\]
The fitted values \(\hat{y} = (\hat{y}_1, \hat{y}_2, \dots , \hat{y}_n)^T\) may be written as
\[\hat{y}=X\hat{\beta}=X(X^T X)^{-1} X^T y=Hy,\]
where \(H=X(X^T X)^{-1} X^T\).
H is called the hat matrix or projection matrix.
The \(i\)th fitted value is given by \[\begin{aligned}\hat{y}_i = & \hat{E}\left(Y|X_1 = x_{i1}, \dots , X_{p-1} = x_{i, p-1} \right)\\ =& \hat{\beta}_0 + \hat{\beta}_1x_{i1}+\hat{\beta}_2x_{i2} +\dots + \hat{\beta}_{p-1}x_{i,p-1}.\end{aligned}\]
The \(i\)th fitted value is the estimated mean response for the ith observation.
The fitted model is given by \[\hat{Y}=\hat{E}\left(Y|X\right) = \hat{\beta}_0 + \hat{\beta}_1X_{1}+\hat{\beta}_2X_{2} +\dots + \hat{\beta}_{p-1}X_{p-1}.\]
The fitted model is the estimated mean response as a function of the regressors.
The residuals, \(\hat{\epsilon} = (\hat{\epsilon}_1, \hat{\epsilon}_2, \dots , \hat{\epsilon}_n)^T\), may be written as
\[\hat{\epsilon} = y-\hat{y} = y-Hy = (I-H)y\]
where \(I\) is the identity matrix of size \(n\times n\).
Note that the residual of observation \(i\) is still
\[\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1x_{i1}+\hat{\beta}_2x_{i2} +\dots + \hat{\beta}_{p-1}x_{i,p-1}).\]
Show that \(H\) is symmetric.
Show that \(H\) is symmetric.
Proof Note that, for any square and invertible matrix \(A\), \((A^T)^{-1}= (A^{-1})^T\).
\[\begin{aligned} H^T = & \left(X(X^TX)^{-1}X^T\right)^T \\ = & (X^T)^T \left((X^TX)^{-1}\right)^TX^T\\ = & X \left((X^TX)^T\right)^{-1}X^T\\ = & X \left(X^TX\right)^{-1}X^T\\ = & H. \end{aligned}\]
Show that \(H\) is idempotent.
Show that \(H\) is idempotent.
Proof Note that, a matrix \(A\) is idempotent if \(AA = A\).
\[\begin{aligned} HH = & \left(X(X^TX)^{-1}X^T\right) \left(X(X^TX)^{-1}X^T\right)\\ = & X(X^TX)^{-1}(X^T X)(X^TX)^{-1}X^T\\ = & X(X^TX)^{-1}X^T\\ = & H. \end{aligned}\]
\(I-H\) is also symmetric and idempotent.
symmetric + idempotent = orthogonal projection
\[\begin{aligned}\langle \hat{y}, \hat{\epsilon} \rangle = & \hat{y}^T\hat{\epsilon}\\ = & (Hy)^T(I-H)y\\ =& y^TH^T(I-H)y\\=& y^TH(I-H)y\\=&y^T(H-HH)y\\=&y^T(H-H)y\\=& y^T\mathbf{0}y\\=& 0\end{aligned}\]
\[\begin{aligned} Cov\left[\hat{y},\hat{\epsilon}|X\right] = & Cov\left[Hy, (I-H)y|X\right]\\ = & H\ Cov\left[y, y|X\right](I-H)^T\\ =& H\ \sigma^2 (I-H)^T\\ =& \sigma^2 H(I-H)^T\\ =& \sigma^2 H(I-H)\\ =& \sigma^2 (H-HH)\\ =& 0 \end{aligned}\]
Can you see, \[||y||^2 = ||\hat{y}||^2 + ||\hat{\epsilon}||^2\] \[Var(y) = Var(\hat{y}) + Var(\hat{\epsilon})\]
The residual or error sum of squares, RSS, may be written as
\[RSS = \sum \hat{\epsilon}_i^2 = \hat{\epsilon}^T\hat{\epsilon} = y^T(I-H)y\]
\[\begin{aligned} RSS = & \hat{\epsilon}^T\hat{\epsilon} \\ = & \left((I-H)y\right)^T\left((I-H)y\right)\\ =& y^T(I-H)^T(I-H)y\\ =& y^T(I-H)(I-H)y & (I-H) \text{ is symmetric}\\ =& y^T(I-H)y & (I-H) \text{ is idempotent} \end{aligned}\]
The ordinary least squares estimator \(\hat{\beta}\) is an unbiased estimator for \(\beta\).
The ordinary least squares estimator \(\hat{\beta}\) is an unbiased estimator for \(\beta\).
\[\begin{aligned} E[\hat{\beta}] = & E\left[(X^TX)^{-1}X^Ty \right]\\ = & (X^TX)^{-1}X^TE\left[y \right]\\ = & (X^TX)^{-1}X^TE\left[X\beta + \epsilon \right]\\ = & (X^TX)^{-1}X^T\left[X\beta + E(\epsilon) \right]\\ = & (X^TX)^{-1}X^TX\beta \\ = & \beta \end{aligned}\]
Provided that \(Var(\epsilon)=\sigma^2I\), show that
\[Var(\hat{\beta}) = \sigma^2(X^TX)^{-1}\]
Provided that \(Var(\epsilon)=\sigma^2I\), show that
\[Var(\hat{\beta}) = \sigma^2(X^TX)^{-1}\]
Proof \[\begin{aligned} Var(\hat{\beta}) = & Var \left((X^TX)^{-1}X^Ty\right)\\ =& (X^TX)^{-1}X^T Var(y)\left((X^TX)^{-1}X^T\right)^T\\ =& (X^TX)^{-1}X^T Var(X\beta + \epsilon)X(X^TX)^{-1}\\ =& (X^TX)^{-1}X^T Var(\epsilon)X(X^TX)^{-1}\\ =& (X^TX)^{-1}X^T \sigma^2IX(X^TX)^{-1}\\ =& \sigma^2(X^TX)^{-1}X^T X(X^TX)^{-1}\\ =& \sigma^2(X^TX)^{-1} \end{aligned}\]
The estimator \[\hat{\sigma}^2 = \frac{\hat{\epsilon}^T\hat{\epsilon}}{n-p}\] is an unbiased estimator of \(\sigma^2\).
The estimated standard error of \(\hat{\beta}_{i-1}\) is
\[\hat{se}\left(\hat{\beta}_{i-1}\right) = \hat{\sigma}\sqrt{\left(X^TX\right)^{-1}_{ii}}\] where \(\left(X^TX\right)^{-1}_{ii}\) is the element in the \(i\)th diagonal position of \((X^TX)^{-1}\).
The data contained in gala in the faraway package in R are related to the number of plant species on the various Galapagos Islands. There are 30 cases (Islands) and seven variables in the data set. The variables are:
data(gala, package = 'faraway') str(gala)
## 'data.frame': 30 obs. of 7 variables: ## $ Species : num 58 31 3 25 2 18 24 10 8 2 ... ## $ Endemics : num 23 21 3 9 1 11 0 7 4 2 ... ## $ Area : num 25.09 1.24 0.21 0.1 0.05 ... ## $ Elevation: num 346 109 114 46 77 119 93 168 71 112 ... ## $ Nearest : num 0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ... ## $ Scruz : num 0.6 26.3 58.7 47.4 1.9 ... ## $ Adjacent : num 1.84 572.33 0.78 0.18 903.82 ...
X = gala[,c("Area", "Elevation", "Nearest", "Scruz", "Adjacent")]
X$Const = rep(1, nrow(X))
X <- as.matrix(X)
Y <- gala$Species
solve(t(X)%*%X)%*%t(X)%*%Y
## [,1] ## Area -0.023938338 ## Elevation 0.319464761 ## Nearest 0.009143961 ## Scruz -0.240524230 ## Adjacent -0.074804832 ## Const 7.068220709
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala) summary(lmod)$coefficients
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.068220709 19.15419782 0.369016796 7.153508e-01 ## Area -0.023938338 0.02242235 -1.067610554 2.963180e-01 ## Elevation 0.319464761 0.05366280 5.953187968 3.823409e-06 ## Nearest 0.009143961 1.05413595 0.008674366 9.931506e-01 ## Scruz -0.240524230 0.21540225 -1.116628222 2.752082e-01 ## Adjacent -0.074804832 0.01770019 -4.226216850 2.970655e-04
Estimate column provides the estimated regression coefficients for each regressor.Std. Error column provides the estimated standard error of each estimated coefficient.t value column provides the test statistic for the null hypothesis that the regression coefficient is zero, i.e., \(\hat{\beta}_j/\hat{se}(\hat{\beta}_j),\quad j=0,1,\dots,p-1\).Pr(>|t|) column provides the two-tailed p-value for the test that each coefficient is zero (assuming the other regressors are in the model).Numerous quantities related to our linear model can be extracted from our fitted model using certain functions:
Other quantities can be extracted. To know what they are, look at the names in the model object and the summary object using the str or names functions.
Objects we can extract from model object
names(lmod)
## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "xlevels" "call" "terms" "model"
Objects we can extract from the summary object
names(summary(lmod))
## [1] "call" "terms" "residuals" "coefficients" ## [5] "aliased" "sigma" "df" "r.squared" ## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
What is the equation for the fitted model?
What is the fitted value for the first observation?
What is the residual for the first observation?
What is the equation for the fitted model?
\[\begin{aligned}Species = & 7.068 - 0.024\ Area + 0.320 \ Elevation + 0.009 \ Nearest \\ & - 0.24 Scruz - 0.075 Adjacent\end{aligned}\]
What is the fitted value for the first observation?
What is the residual for the first observation?
What is the equation for the fitted model?
\[\begin{aligned}Species = & 7.068 - 0.024\ Area + 0.320 \ Elevation + 0.009 \ Nearest \\ & - 0.24 Scruz - 0.075 Adjacent\end{aligned}\]
What is the fitted value for the first observation?
lmod$fitted.values[1]
## Baltra ## 116.7259
What is the residual for the first observation?
What is the equation for the fitted model?
\[\begin{aligned}Species = & 7.068 - 0.024\ Area + 0.320 \ Elevation + 0.009 \ Nearest \\ & - 0.24 Scruz - 0.075 Adjacent\end{aligned}\]
What is the fitted value for the first observation?
lmod$fitted.values[1]
## Baltra ## 116.7259
What is the residual for the first observation?
lmod$residuals[1]
## Baltra ## -58.72595
sqrt(deviance(lmod)/df.residual(lmod))
## [1] 60.97519
summary(lmod)$sigma
## [1] 60.97519
sqrt(sum(residuals(lmod)^2)/df.residual(lmod))
## [1] 60.97519
sqrt(sum(lmod$residuals^2)/lmod$df.residual)
## [1] 60.97519
# Estimates summary(lmod)$coefficients[,1]
## (Intercept) Area Elevation Nearest Scruz Adjacent ## 7.068220709 -0.023938338 0.319464761 0.009143961 -0.240524230 -0.074804832
# Std. Error summary(lmod)$coefficients[,2]
## (Intercept) Area Elevation Nearest Scruz Adjacent ## 19.15419782 0.02242235 0.05366280 1.05413595 0.21540225 0.01770019
# p-values summary(lmod)$coefficients[,4]
## (Intercept) Area Elevation Nearest Scruz Adjacent ## 7.153508e-01 2.963180e-01 3.823409e-06 9.931506e-01 2.752082e-01 2.970655e-04
The estimator for \(\hat{\beta}\) that we derived certainly seems like a reasonable choice, but there are others. Why use this one?
Specifically, the in the context of linear regression, the Gauss-Markov theorem is used to prove that among all unbiased linear estimators of \(\beta\) (i.e., estimators of the form \(Ay\), where \(A\) is a constant matrix of size p×n), \(\hat{\beta}\) has the minimum variance, and this estimate is unique.
The Gauss-Markov theorem assumes that:
Other estimators may be better if the assumptions of the Gauss-Markov theorem are not satisfied.
Other possibilities: - Generalized least squares should be used when the errors are correlated. - Robust estimators should be used for long-tailed distributions. - When the regressors are highly correlated, then alternative estimators such as ridge regression may be preferred.





Total Sum of Squares (TSS) \[\sum(y_i-\overline{y})^2\] Residual Sum of Squares (RSS)
\[\sum(y_i-\hat{y}_i)^2\]
Regression Sum of Squares (\(SS_{reg}\)) \[\sum(\hat{y}_i-\overline{y})^2\]
\(TSS = RSS + SS_{reg}\)
TSS measures the amount of variability in our responses in comparison to the mean.
RSS measures the amount of variability in our responses that is not explained by our model.
\(SS_{reg}\) measures the additional amount of variability in our responses explained by our model after accounting for a constant mean.

A common measure of goodness of fit is \(R^2\), the coefficient of determination (or percentage of variance explained), given by
\[R^2 = 1 - \frac{\sum (\hat{y}_i-y_i)^2}{\sum (y_i-\overline{y})^2} = 1 - \frac{RSS}{TSS} = \frac{SS_{reg}}{TSS}= cor^2(y,\hat{y})\]
The range of \(R^2\) is [0, 1], with values closer to 1 indicating better fits (in general).
A large \(R^2\) does not necessarily mean that useful prediction can be made or that the model is a good fit.
A small \(R^2\) does not mean that the regressors are unrelated to the response.
Conceptually, \(R^2\) is 1 minus the ratio of the size of the blue arrow to the red arrow.


summary(lm1)
## ## Call: ## lm(formula = y ~ x, data = dat[dat$model == 2, ]) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.0212456 -0.0042339 -0.0008513 0.0047407 0.0207738 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.001232 0.001581 -0.78 0.438 ## x 1.002688 0.002731 367.14 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.007963 on 98 degrees of freedom ## Multiple R-squared: 0.9993, Adjusted R-squared: 0.9993 ## F-statistic: 1.348e+05 on 1 and 98 DF, p-value: < 2.2e-16
summary(lm1)$r.squared
## [1] 0.9992735
This definition of R^2 only makes sense if our model has an intercept in it since the null model in the denominator has an intercept. - If a model doesn’t have an intercept, the R^2 given by the formula will be misleadingly high. - If a model doesn’t have an intercept, R^2 should be calculated using the formula \(R^2=cor^2 (y,\hat{y})\).
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala) summary(lmod)$r.squared
## [1] 0.7658469
lmod_no_intercept = lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent - 1, data = gala) summary(lmod_no_intercept)$r.squared
## [1] 0.8501933
cor(gala$Species, lmod_no_intercept$fitted.values)^2
## [1] 0.7652718
Do not rely on R^2 alone for model fit.
Anscombe’s quartet is a very famous data set where four data sets have the same least squares fit and coefficient of determination (essentially), but the data look completely different.
For each data set, the least squares line has \(\hat{\beta}_0≈3\) and \(\hat{\beta}_1≈0.5\). Similarly, \(R^2\approx 0.67\).
Though each model has an \(R^2\) of about 0.67, the fit to the data is very different!

\(R^2\) is most useful if there is approximately a planar relationship in the data.
If the plot of \(y\) versus \(\hat{y}\) is approximately a straight line, then \(R^2\) is a meaningful summary of model fit (otherwise it is not).
For which of the following models is R^2 a useful measure of model fit?

Every statistic has an associated degrees of freedom.
Generally, this is n minus the number of parameters that were estimated in computing the statistic.
TSS has \(n-1\) degrees of freedom since only the intercept \(\beta_0\) must be estimated.
RSS has \(n-p\) degrees of freedom since \(p\) regression coefficients \((\beta_0,\beta_1,\dots ,\beta_(p-1))\) must be estimated.
\(SS_{reg}\) has \(p-1\) degrees of freedom.
The least squares estimator is the solution to the normal equations \[X^T X\hat{\beta}=X^T y\], where \(X\) is an \(n\times p\) matrix of predictors.
If \(X^T X\) is singular and cannot be inverted, there is no unique solution to the normal equations and \(\hat{\beta}\) is unidentifiable.
Unidentifiability will occur when \(X\) is not full rank (the columns of \(X\) are linearly dependent).
Problems with unidentifiability are rare for observational data, but more common with experiments. For observational data, we may have:
A person’s weight is measured and entered in both pounds and kilos. One is just a multiple of the other.
We record the number of years of pre-university education, university education, and total education. There is a perfect linear relationship between these variables.
We have more variables than observations. When \(p=n\) we have a saturated model, we may be able to estimate some of the parameters. When \(p>n\), we have a supersaturated model, and cannot uniquely estimate parameters. This happens with genetic data sometimes.
Removing one (or more) of the linearly dependent columns should correct the problem.
This issue is more common in designed experiments. Consider a two-sample experiment where the treatment observations are \(y_1,\dots,y_n\) and the controls are \(y_{n+1},\dots ,y_{n+m}\).
We model the overall mean μ and group effects α_1 and α_2:
\[y_j=\mu+\alpha_i+\epsilon_j,\quad i=1,2, \quad j=1,\dots,m+n,\] or more specifically,
\[\begin{pmatrix}y_1\\\vdots\\y_n\\y_{n+1}\\\vdots\\y_{m+n}\end{pmatrix} = \begin{pmatrix} 1 & 1 & 0\\ \vdots & \vdots & \vdots \\1 & 1 & 0\\1 & 0 & 1\\\vdots & \vdots & \vdots \\ 1 & 0 & 1\end{pmatrix} \begin{pmatrix}\mu \\\alpha_1\\\alpha_2\end{pmatrix} + \begin{pmatrix}\epsilon_1\\\vdots\\\epsilon_{m+n}\end{pmatrix}\]
\(X\) only has rank two, so the model is not identifiable and the normal equations have infinitely many solutions.
We solve the problem by imposing some constraints, e.g., \(\alpha_1=0\) (set to zero/corner constraint) or \(\alpha_1+\alpha_2=0\) (sum to zero constraint).
R tries to fit the largest identifiable model by removing variables in reverse order of appearance in the model formula.
Create a new variable Adiff in the Galapagos dataset—the difference in area between the island and its nearest neighbor.
gala$Adiff <- gala$Area-gala$Adjacent
If we fit a model using this variable and the previous variables, we get a note about a singularity.
lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz+Adjacent+Adiff,gala) sumary(lmod)
## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.068221 19.154198 0.3690 0.7153508 ## Area -0.023938 0.022422 -1.0676 0.2963180 ## Elevation 0.319465 0.053663 5.9532 3.823e-06 ## Nearest 0.009144 1.054136 0.0087 0.9931506 ## Scruz -0.240524 0.215402 -1.1166 0.2752082 ## Adjacent -0.074805 0.017700 -4.2262 0.0002971 ## ## n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
Models can also be close to unidentifiable.
set.seed(123) Adiffe <- gala$Adiff+0.001*(runif(30)-0.5) lmod <- lm(Species ~ Area+Elevation+Nearest+Scruz +Adjacent+Adiffe,gala) sumary(lmod)
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.2964e+00 1.9434e+01 0.1696 0.8668 ## Area -4.5123e+04 4.2583e+04 -1.0596 0.3003 ## Elevation 3.1302e-01 5.3870e-02 5.8107 6.398e-06 ## Nearest 3.8273e-01 1.1090e+00 0.3451 0.7331 ## Scruz -2.6199e-01 2.1581e-01 -1.2140 0.2371 ## Adjacent 4.5123e+04 4.2583e+04 1.0596 0.3003 ## Adiffe 4.5123e+04 4.2583e+04 1.0596 0.3003 ## ## n = 30, p = 7, Residual SE = 60.81975, R-Squared = 0.78
Orthogonal regressors make it easier to interpret the effect of each regressor on the response.
Regressors \(x_i\) and \(x_j\) are orthogonal if \((x_i-\overline{x}_i)^T (x_j-\overline{x}_j)=0\).
If a regressor is orthogonal to another regressor, it’s addition or deletion from the fitted model will not change the estimated regression coefficient for the other regressor (though the estimated standard error will change slightly).
Orthogonality typically only occurs when X is chosen by the experimenter.
Regressors are almost never orthogonal for observational data.
Consider the relationship between column temperature, gas/liquid ratio, and packing height in reducing the unpleasant odor of a chemical product that was sold in household use.
Each of the predictors has been rescaled to -1, 0, and 1.
Note that even if we remove a regressor, the estimated coefficients are the same, but the test statistics and residual standard errors change slightly.
data(odor, package = 'faraway') head(odor)
## odor temp gas pack ## 1 66 -1 -1 0 ## 2 39 1 -1 0 ## 3 43 -1 1 0 ## 4 49 1 1 0 ## 5 58 -1 0 -1 ## 6 17 1 0 -1
x <- model.matrix(~ temp + gas + pack, data = odor) cov(x)
## (Intercept) temp gas pack ## (Intercept) 0 0.0000000 0.0000000 0.0000000 ## temp 0 0.5714286 0.0000000 0.0000000 ## gas 0 0.0000000 0.5714286 0.0000000 ## pack 0 0.0000000 0.0000000 0.5714286
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.2000 9.2981 1.6347 0.1304 ## temp -12.1250 12.7320 -0.9523 0.3614 ## gas -17.0000 12.7320 -1.3352 0.2088 ## pack -21.3750 12.7320 -1.6788 0.1213 ## ## n = 15, p = 4, Residual SE = 36.01155, R-Squared = 0.33
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15.200 9.262 1.6411 0.1267 ## gas -17.000 12.682 -1.3404 0.2049 ## pack -21.375 12.682 -1.6854 0.1177 ## ## n = 15, p = 3, Residual SE = 35.87162, R-Squared = 0.28
403 African Americans were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia. The diabetes data are in the faraway package.
We will consider the relationship between cholesterol and several of the predictors.
Regress total cholesterol on Stabilized Glucose, high density lipoprotein, glycosolated hemoglobin, age, gender, height, weight, waist size, and hip size.