















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Linear Model Selection and Regularization - Exercise R code as soutution manual ISLR Introduction to Statistical Learning James, Witten, Hastie, Tibshirani
Typology: Exercises
1 / 23
This page cannot be seen from the preview
Don't miss anything!
title: "Chapter 6: Linear Model Selection and Regularization" author: "Solutions to Exercises" date: "January 23, 2016" output: html_document: keep_md: no
EXERCISE 1: Part a) Best subset will have the smallest train RSS because the models will optimize on the training RSS and best subset will try every model that forward and backward selection will try. Part b) The best test RSS model could be any of the three. Best subset could easily overfit if the data has large $p$ predictors relative to $n$ observations. Forward and backward selection might not converge on the same model but try the same number of models and hard to say which selection process would be better. Part c)
EXERCISE 2: Part a) iii. is TRUE - lasso puts a budget constraint on least squares (less flexible) Part b) iii. is TRUE - ridge also puts a budget constraint on least squares (less flexible) Part c) ii. is TRUE - a non-linear model would be more flexible and have higher variance, less bias
EXERCISE 3: Part a)
Part a) iii. is TRUE - training error increases steadily Part b) ii. is TRUE - test error will decrease initially and then increase Part c) iv. is TRUE - variance always decrease with more constraints Part d) iii. is TRUE - bias always increase with less model flexibility Part e) v. is TRUE - the irreducible error is a constant value, not related to model selection
EXERCISE 5: Part a)
Ridge: minimize $(y_1 - \hat\beta_1x_{11} - \hat\beta_2x_{12})^2 + (y_2 - \hat\beta_1x_{21} - \hat
beta_2x_{22})^2 + \lambda (\hat\beta_1^2 + \hat\beta_2^2)$ Part b) Step 1: Expanding the equation from Part a: $$(y_1 - \hat\beta_1 x_{11} - \hat\beta_2 x_{12})^2 + (y_2 - \hat\beta_1 x_{21} - \hat\beta_2 x_{22})^
$$\lambda\frac{|\beta_1|}{\beta_1} = \lambda\frac{|\beta_2|}{\beta_2}$$ So it seems that the lasso just requires that $\beta_1$ and $\beta_2$ are both positive or both negative (ignoring possibility of 0...)
EXERCISE 6: Part a)
For $y=7$ and $\lambda=10$, $\hat\beta=\frac{7}{1+10}$ minimizes the ridge regression equation Part b)
Part d)
# forward selection regfit.fwd <- regsubsets(Y~poly(X,10,raw=T), data=data.frame(Y,X), nvmax=10) (fwd.summary <- summary(regfit.fwd)) # backward selection regfit.bwd <- regsubsets(Y~poly(X,10,raw=T), data=data.frame(Y,X), nvmax=10) (bwd.summary <- summary(regfit.bwd)) par(mfrow=c(3,2)) min.cp <- which.min(fwd.summary$cp) plot(fwd.summary$cp, xlab="Number of Poly(X)", ylab="Forward Selection Cp", type="l") points(min.cp, fwd.summary$cp[min.cp], col="red", pch=4, lwd=5) min.cp <- which.min(bwd.summary$cp) plot(bwd.summary$cp, xlab="Number of Poly(X)", ylab="Backward Selection Cp", type="l") points(min.cp, bwd.summary$cp[min.cp], col="red", pch=4, lwd=5) min.bic <- which.min(fwd.summary$bic) plot(fwd.summary$bic, xlab="Number of Poly(X)", ylab="Forward Selection BIC", type="l") points(min.bic, fwd.summary$bic[min.bic], col="red", pch=4, lwd=5) min.bic <- which.min(bwd.summary$bic) plot(bwd.summary$bic, xlab="Number of Poly(X)", ylab="Backward Selection BIC", type="l") points(min.bic, bwd.summary$bic[min.bic], col="red", pch=4, lwd=5) min.adjr2 <- which.max(fwd.summary$adjr2) plot(fwd.summary$adjr2, xlab="Number of Poly(X)", ylab="Forward Selection Adjusted R^2", type="l") points(min.adjr2, fwd.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5) min.adjr2 <- which.max(bwd.summary$adjr2) plot(bwd.summary$adjr2, xlab="Number of Poly(X)", ylab="Backward Selection Adjusted R^2", type="l") points(min.adjr2, bwd.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5) # coefficients of selected models coef(regfit.fwd, which.min(fwd.summary$cp)) coef(regfit.bwd, which.min(bwd.summary$cp)) coef(regfit.fwd, which.min(fwd.summary$bic)) coef(regfit.bwd, which.min(bwd.summary$bic)) coef(regfit.fwd, which.max(fwd.summary$adjr2)) coef(regfit.bwd, which.max(bwd.summary$adjr2))
Best subset, foward selection and backward selection all resulted in the same best models Part e)
Lasso selects the correct model but best subset diagnostics indicate using 1 to 4 predictors
EXERCISE 9: Part a)
Part b)
fit.lm <- lm(Apps~., data=train) pred.lm <- predict(fit.lm, test) (err.lm <- mean((test$Apps - pred.lm)^2)) # test error
Part c)
Part d)
The test errors aren't much different. The ridge and lasso seem to perform slightly better while the PCR/ PLS don't show any improvement from the full linear regression model.
plot(test$Apps, pred.lm)
EXERCISE 10: Part a)
set.seed(1) eps <- rnorm(1000) xmat <- matrix(rnorm(1000*20), ncol=20) betas <- sample(-5:5, 20, replace=TRUE) betas[c(3,6,7,10,13,17)] <- 0 betas y <- xmat %*% betas + eps
Part b)
set.seed(1) trainid <- sample(1:1000, 100, replace=FALSE) xmat.train <- xmat[trainid,] xmat.test <- xmat[-trainid,] y.train <- y[trainid,] y.test <- y[-trainid,] train <- data.frame(y=y.train, xmat.train) test <- data.frame(y=y.test, xmat.test)
Part c)
require(leaps) # predict function from chapter 6 labs predict.regsubsets <- function(object, newdata, id, ...){ form <- as.formula(object$call[[2]]) mat <- model.matrix(form, newdata) coefi <- coef(object, id=id) xvars <- names(coefi) mat[,xvars]%*%coefi } regfit.full <- regsubsets(y~., data=train, nvmax=20) (coef.best <- coef(regfit.full, id=which.min(err.full))) betas[betas != 0] names(betas) <- paste0("X", 1:20) merge(data.frame(beta=names(betas),betas), data.frame(beta=names(coef.best),coef.best), all.x=T, sort=F)
The best subset model selected all the correct predictors Part g)
err.coef <- rep(NA, 20) for(i in 1:20) { coef.i <- coef(regfit.full, id=i) df.err <- merge(data.frame(beta=names(betas),betas), data.frame(beta=names(coef.i),coef.i), all.x=T) df.err[is.na(df.err[,3]),3] <- 0 err.coef[i] <- sqrt(sum((df.err[,2] - df.err[,3])^2)) } plot(1:20, err.coef, type="b", main="Coefficient Error", xlab="Number of Predictors") points(which.min(err.coef), err.coef[which.min(err.coef)], col="red", pch=16)
The coefficient error plot shows a very similar plot to the test error plot
Part a)