17  Model selection

Packages

library(MuMIn) # also exhaustive model searching

Data tranformations

Sometimes the relationship between two variables is not linear, which makes the linear model unsuitable. You also see below that the variance in Ozone increases with increasing Temperature (heteroscedasticity), which also violates the assumption of homogeneous variance for the linear regression model.

data("airquality")
plot(airquality$Temp, airquality$Ozone, cex=.8,
     xlab='Temperature [K]', ylab='Ozone [ppm]')

Transforming the response variable (here with a log-transformation) can linearize the relationship and homogenize the variance.

lmd <- lm(log(Ozone) ~ Temp, data=airquality)

plot(airquality$Temp, log(airquality$Ozone), cex=.8,
     xlab='Temperature [K]', ylab='log-Ozone [ppm]')
abline(lmd, col='blue')

Example data

We will illustrate the variable selection method on data from 50 U.S. states. The variables are population estimate as of July 1, 1975, per capita income (1974), illiteracy (1970, percent of population), life expectancy in years (1969-71), murder and non-negligent manslaughter rate per 100,000 population (1976), percent high-school graduates (1970), mean number of days with min temperature 32 degrees (1931- 1960) in capital or large city, and land area in square miles. The data was collected from US Bureau of the Census. We will take life expectancy as the response and the remaining variables as predictors. Example and most text is taken from Faraway (2002).

data("state")
names(state.x77)
NULL
# a fix is necessary to remove spaces in some of the variable names
statedata <- data.frame(state.x77, row.names=state.abb, check.names=T)
head(statedata)
   Population Income Illiteracy Life.Exp Murder HS.Grad Frost   Area
AL       3615   3624        2.1    69.05   15.1    41.3    20  50708
AK        365   6315        1.5    69.31   11.3    66.7   152 566432
AZ       2212   4530        1.8    70.55    7.8    58.1    15 113417
AR       2110   3378        1.9    70.66   10.1    39.9    65  51945
CA      21198   5114        1.1    71.71   10.3    62.6    20 156361
CO       2541   4884        0.7    72.06    6.8    63.9   166 103766

Selection criteria

Model selection is the process of selecting a model from a set of candidate models. To do this, we need to define potential selection criteria. Selection techniques based on probabilistic measures balance the goodness of fit with model simplicity (parsimonious models). If models include too many predictors, they may poorly generalize (overfitting) and the may be harder to interpret. Selection criteria based purely on model performance can also avoid overfitting (they incorporate cross-validation), but model complexity is not so much of a concern than there is prediction performance.

AIC

If there are p potential predictors, then there are 2\(p\) possible models. We fit all these models and choose the best one according to some criterion. The Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC) are some other commonly used criteria.

\[AIC = -2*log - likelihood + 2 p \tag{17.1}\]

For linear regression models, the -2log-likelihood (known as the deviance is n*log(RSS/n) . We want to minimize AIC. Larger models will fit better and so have smaller RSS (residual sum of squares) but use more parameters. Thus the best choice of model will balance fit with model size.