6/4/2020
Estimation and inference for a regression model depend on several assumptions.
These assumptions must be checked using regression diagnostics.
There are three main categories of linear regression assumptions:
Model: The structural (mean) part of the model is correct, i.e., \(E(y)=X\beta\).
Error: \(\epsilon\sim N(0,\sigma^2 I)\), i.e., that the errors are normally distributed, independent, and identically distributed with mean 0 and variance \(\sigma^2\).
Unusual observations: All observations are equally reliable and have approximately equal role in determining the regression results and in influencing conclusions.
Diagnostics techniques may be:
Regression diagnostics often suggest improvements, causing you to fit another model, do more diagnostics, etc.
Model building is an iterative process!
At this time, we will only concern ourselves with checking model structure, which is the most important assumption.
Note: most of the functions we will use are in the car package.
If the linear model is correctly specified, then \(cor(\hat{\epsilon},\hat{y})=0\) and \(cor(\hat{\epsilon},x_j )=0\).
Patterns in the plots of \(\hat{\epsilon}\) versus \(\hat{y}\) or \(\hat{\epsilon}\) versus \(x_j\) can occur only if some of the model assumptions are violated.
These plots should be “null plots” with no systematic features.
A lack-of-fit test can be used to examine a plot of the residuals versus the regressors.
Tukey’s test for nonadditivity can be used to examine the plot of the residuals versus fitted values.
If either of these tests are significant, it suggests there is unaccounted curvature in the data that is not captured by the fitted model.
The residualPlots function can be used to easily produce these residual plots and perform the associated tests.
We examine the relationship between occupational “prestige” and various predictors among Canadians.
The data include the variables:
education - Average education of occupational incumbents, years, in 1971.income - Average income of incumbents, dollars, in 1971.women - Percentage of incumbents who are women.prestige - Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.census - Canadian Census occupational code.type - Type of occupation. A factor variable with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.prestige ~ education + income + typedata(Prestige, package = 'carData') lmod = lm(prestige ~ education + income + type, data = Prestige) car::residualPlots(lmod)

## Test stat Pr(>|Test stat|) ## education -0.6836 0.495942 ## income -2.8865 0.004854 ** ## type ## Tukey test -2.6104 0.009043 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
prestige ~ education + income + typedata(Prestige, package = 'carData') lmod = lm(prestige ~ education + income + type, data = Prestige) knitr::kable(car::residualPlots(lmod, plot = F, test = F))
## Test stat Pr(>|Test stat|) ## education -0.6836 0.495942 ## income -2.8865 0.004854 ** ## type ## Tukey test -2.6104 0.009043 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Test stat | Pr(>|Test stat|) | |
|---|---|---|
| education | -0.6836062 | 0.4959420 |
| income | -2.8864740 | 0.0048544 |
| type | NA | NA |
| Tukey test | -2.6104250 | 0.0090430 |
lmod = lm(prestige ~ education + income +I(income^2) + type , data = Prestige) summary(lmod)
## ## Call: ## lm(formula = prestige ~ education + income + I(income^2) + type, ## data = Prestige) ## ## Residuals: ## Min 1Q Median 3Q Max ## -13.6515 -4.4852 0.3803 4.6601 19.1576 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -3.148e+00 5.108e+00 -0.616 0.53930 ## education 3.175e+00 6.404e-01 4.957 3.25e-06 *** ## income 2.648e-03 6.049e-04 4.377 3.17e-05 *** ## I(income^2) -6.376e-08 2.209e-08 -2.886 0.00485 ** ## typeprof 7.249e+00 3.746e+00 1.935 0.05606 . ## typewc -1.118e+00 2.485e+00 -0.450 0.65389 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.831 on 92 degrees of freedom ## (4 observations deleted due to missingness) ## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8403 ## F-statistic: 103.1 on 5 and 92 DF, p-value: < 2.2e-16
lmod = lm(prestige ~ education + income + type , data = Prestige) Prestige_new <- cbind(na.omit(Prestige), data.frame(fitted = lmod$fitted.values)) lmod1 = lm(prestige ~ education + income + type + I(fitted^2) , data = Prestige_new) summary(lmod1)$coefficients
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -23.354787776 1.007721e+01 -2.317584 2.268987e-02 ## education 7.041975447 1.432323e+00 4.916472 3.830669e-06 ## income 0.002326133 5.467194e-04 4.254710 5.028966e-05 ## typeprof 12.597502393 4.514985e+00 2.790154 6.404466e-03 ## typewc -6.373808834 2.808685e+00 -2.269321 2.558554e-02 ## I(fitted^2) -0.009658421 3.699942e-03 -2.610425 1.055735e-02
pnorm(summary(lmod1)$coefficients[6,3])
## [1] 0.00452149
The education variable shows no systematic patterns.
The education variable shows no systematic patterns.
The income variable has a somewhat non-linear pattern, which is confirmed by the lack-of-fit test.
The education variable shows no systematic patterns.
The income variable has a somewhat non-linear pattern, which is confirmed by the lack-of-fit test.
The type variable shows no clear systematic pattern (prof has a median slightly above 0, but that can certainly happen due to sampling variation).
The education variable shows no systematic patterns.
The income variable has a somewhat non-linear pattern, which is confirmed by the lack-of-fit test.
The type variable shows no clear systematic pattern (prof has a median slightly above 0, but that can certainly happen due to sampling variation).
There is clear non-linearity in the plot of the residuals versus fitted values, as confirmed Tukey’s test for nonadditivity.
Residual plots can detect nonlinearity, but they cannot be used to determine whether the nonlinearity is monotonic (think a log relationship) or non-monotonic (think a quadratic relationship).
Consider two data sets, A and B. Fit the following models to the data sets:
A: \(y=\beta_0+\beta_1 \sqrt{x}+\epsilon\)
B: \(y=\beta_0+\beta_1 x+\beta_2 x^2+\epsilon\)

Surprisingly, the residual plots for the two fitted models (the lower panels) are identical, even though data set A needed a monotonic transformation (log) of x, while data set B required a non-monotonic (quadratic) transformation of x.
The marginal model plot compares the marginal relationship between the response and each regressor.
This plot consists of:
The car package uses the loess smoother.
For a non-problematic fitted model:
For genuine, real-life, noisy data, it is possible neither line fits the data very well, but they should match any obvious structural patterns.
The marginalModelPlots function generates these graphs.
Note: not seeing a problem does NOT indicate we have a good model, only that there are no apparent problems.
prestige ~ education + income + typecar::marginalModelPlots(lmod)
## Warning in mmps(...): Interactions and/or factors skipped

The marginal model plots do not provide clear evidence of a model problem for the occupational prestige data, though the plot of \(y\) versus \(\hat{y}\) is suspicious.
Added variable (av) plots (or partial regression plots) help to isolate the impact of regressor \(x_i\) on the response \(y\), after accounting for the effect of the other regressors in the model.
Added variable plots can identify a non-linear relationship between the response and a regressor.
If the data follow a clear non-linear pattern in comparison with the least-squares line, then there is a structural problem with our model.
A curve in the points and a dramatic change in the structure of the points would indicate a problem with the structural component of the model.
The added variable plot cannot suggest a transformation because the x-axis is not the original predictor.
The plot CAN indicate whether the transformation should be monotonic or non-monotonic.
The avPlots function can be used to generate added variable plots for a fitted model.
Added variable plots can be used to assess the strength of the relationship between the response and a regressor.
Added variable plots can be used to identify outliers and/or high leverage observations that seem to be influential in determining the estimated coefficient for \(x_i\).
prestige ~ education + income + typecar::avPlots(lmod)

The component plus residual (cr) plot (a.k.a, partial residual plot) is a competitor to the added variable plot.
The cr plot shows \(\hat{\beta}_i x_i+\hat{\epsilon}\) versus \(x_i\).
\(\hat{\beta}_i x_i\) is the “component” for \(x_i\).
This is motivated by the relationship \[y - \sum\limits_{j\neq i} x_j\hat{\beta}_j = \hat{y} +\hat{\epsilon} - \sum\limits_{j\neq i} x_j\hat{\beta}_j =x_i\hat{\beta}_i+\hat{\epsilon}.\]
The idea is to compare the impact of the \(i\)th regressor on the fitted values.
cr plots are useful for checking nonlinear relationships in the variable being considered for inclusion in the model.
The crPlots function can be used to generate component plus residual plots for a fitted model.
prestige ~ education + income + womenlmod = lm(prestige ~ education + income + women, data = Prestige) car::crPlots(lmod)

set.seed(10) y<-c(1:1000) x1<-c(1:1000)*runif(1000,min=0,max=2) x2<-(c(1:1000)*runif(1000,min=0,max=2))^2 x3<-log(c(1:1000)*runif(1000,min=0,max=2)) dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3) pairs(dat)

lmod1<-lm(y~.,data = dat) car::crPlots(lmod1)

lmod2 = lm(y ~ x1 + sqrt(x2)+exp(x3)) car::crPlots(lmod2)

If a relationship is nonlinear but monotone and simple, Mosteller and Tukey’s bulging rule can be used to guide the selection of linearizing transformations.
Compare the graphic below with the type of “bulge” seen in your data; move along the “ladder of transformations” for your response or predictors to determine a helpful transformation.

prestige ~ education + income + womenLet’s look at the cr Plot of the model again:
lmod = lm(prestige ~ education + income + women, data = Prestige) car::crPlots(lmod)

prestige ~ education + income + womenLet’s look at the cr Plot of the model again:
lmod = lm(prestige ~ education + income + women, data = Prestige) car::crPlots(lmod)

The bulge suggests a log or square root transformation of income.
lmod = lm(prestige ~ education + sqrt(income) + women, data = Prestige) car::crPlots(lmod)

log Transformationlmod = lm(prestige ~ education + log(income) + women, data = Prestige) car::crPlots(lmod)

Adding a quadratic for women might further improve the model fit.
Adding a quadratic for women might further improve the model fit.
Adding raw polynomials (x,x2,x3,…) to our model can be problematic.
Orthogonal polynomials have the benefit that adding the higher order term doesn’t impact the estimated coefficients for the other polynomials!
lmod = lm(prestige ~ education + log(income) + poly(women,2), data = Prestige) car::crPlots(lmod)

When a predictor variable is a number between 0 and 1 (or a percentage between 0 and 100), it is not uncommon to observe a logistic relationship between the predictor and the response (e.g, in the cr plot).
In that case, one might transform that predictor using the logit function in the car package to help improve the model fit.
## Warning in car::logit(x): proportions remapped to (0.025, 0.975)

If the relationships between the regressors are strongly nonlinear and not well described by polynomials, then component plus residuals plots may not be effective in identifying nonlinear partial relationships between the response and the regressors.
The CERES plot (Combining conditional Expectations and RESiduals plots) (can) use nonparametric smoothers rather than polynomial regression to adjust for nonlinearities, so it is supposed to be a bit more robust in detecting nonlinearities.
ceresPlots function can be used to generate CERES plots.To construct a CERES plot (for predictor \(x_1\)):
Let \(\hat{x}_{ij} = \hat{g}_{ij}(x_{i1})\) be the fitted value for the \(i\)th observation when regressing \(x_j\) on \(x_1\).
Estimate the coefficients of the model:
\[y = \alpha + \beta_2^{''}x_{i2}+\dots + \beta_{p-1}^{''}x_{i,p-1} + \gamma_{12}\hat{x}_{i2}+\dots +\gamma_{1,p-1}\hat{x}_{i,p-1}+\epsilon_i^{''}.\] Plot the adjusted residuals \[\hat{\epsilon}_i^{(1)} = \hat{\epsilon}^"_i+\hat{\gamma}_{12}\hat{x}_{12}+\dots + \hat{\gamma}_{1,p-1}\hat{x}_{1,p-1}\]
against \(x_1\).
lmod = lm(prestige ~ education + income + women, data = Prestige) car::ceresPlots(lmod)

The savings data has data related to 5 savings-related variables in 50 countries, averaged over the period 1960-1970. The data has the following variables:
sr - savings rate. Personal saving divided by disposable incomepop15 - percent population under age of 15pop75 - percent population over age of 75dpi - per-capita disposable income in dollarsddpi - percent growth rate of dpiAssess whether there are any structural problems for the model regressing savings rate on all other variables.
The transformation approaches presented here are simple, and only work for data with simple non-linearaties.
Other approaches (available in the car package) are the:
For complicated data, no simple transformation or basic linear regression may capture the relationship between the response and regressors.
Raw polynomials can become highly correlated very quickly, leading to numerical instability.
Orthogonal polynomials deal will this by centering each polynomial and (essentially) utilizing the Q part of the QR decomposition of the centered polynomials. (See https://stackoverflow.com/a/19484716/2993948).
x = rnorm(50) # generate polynomials of degree 5 xp = sapply(1:5, function(p) x^p) # highly correlated knitr::kable(round(cor(xp), 3))
| 1.000 | 0.084 | 0.847 | 0.233 | 0.653 |
| 0.084 | 1.000 | 0.286 | 0.909 | 0.458 |
| 0.847 | 0.286 | 1.000 | 0.525 | 0.937 |
| 0.233 | 0.909 | 0.525 | 1.000 | 0.725 |
| 0.653 | 0.458 | 0.937 | 0.725 | 1.000 |
Correlation between the \(Q\) matrix of the \(QR\) decomposition of the centered polynomials and the results produced by the poly function:
# center each polynomial xpc = scale(xp, scale = FALSE) # find the Q matrix of the QR # decomposition of xpc xpoly = qr.Q(qr(xpc)) # correlation between xpoly and poly func. knitr::kable(round(cor(xpoly, poly(x, 5)), 3))
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| -1 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | -1 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 1 |