Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Statistical Learning ISLR Chapter 6 R Solution Manual, Exercises of Statistics

Linear Model Selection and Regularization - Exercise R code as soutution manual ISLR Introduction to Statistical Learning James, Witten, Hastie, Tibshirani

Typology: Exercises

2020/2021

Uploaded on 05/26/2021

ekadant
ekadant 🇺🇸

4.3

(31)

268 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
---
title: "Chapter 6: Linear Model Selection and Regularization"
author: "Solutions to Exercises"
date: "January 23, 2016"
output:
html_document:
keep_md: no
---
***
## CONCEPTUAL
***
<a id="ex01"></a>
>EXERCISE 1:
__Part a)__
Best subset will have the smallest train RSS because the models will optimize on the training RSS and
best subset will try every model that forward and backward selection will try.
__Part b)__
The best test RSS model could be any of the three. Best subset could easily overfit if the data has large
$p$ predictors relative to $n$ observations. Forward and backward selection might not converge on the
same model but try the same number of models and hard to say which selection process would be
better.
__Part c)__
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Introduction to Statistical Learning ISLR Chapter 6 R Solution Manual and more Exercises Statistics in PDF only on Docsity!

title: "Chapter 6: Linear Model Selection and Regularization" author: "Solutions to Exercises" date: "January 23, 2016" output: html_document: keep_md: no



CONCEPTUAL


EXERCISE 1: Part a) Best subset will have the smallest train RSS because the models will optimize on the training RSS and best subset will try every model that forward and backward selection will try. Part b) The best test RSS model could be any of the three. Best subset could easily overfit if the data has large $p$ predictors relative to $n$ observations. Forward and backward selection might not converge on the same model but try the same number of models and hard to say which selection process would be better. Part c)

  • i. TRUE
  • ii. TRUE
  • iii. FALSE
  • iv. FALSE
  • v. FALSE

EXERCISE 2: Part a) iii. is TRUE - lasso puts a budget constraint on least squares (less flexible) Part b) iii. is TRUE - ridge also puts a budget constraint on least squares (less flexible) Part c) ii. is TRUE - a non-linear model would be more flexible and have higher variance, less bias


EXERCISE 3: Part a)

Part a) iii. is TRUE - training error increases steadily Part b) ii. is TRUE - test error will decrease initially and then increase Part c) iv. is TRUE - variance always decrease with more constraints Part d) iii. is TRUE - bias always increase with less model flexibility Part e) v. is TRUE - the irreducible error is a constant value, not related to model selection


EXERCISE 5: Part a)

Ridge: minimize $(y_1 - \hat\beta_1x_{11} - \hat\beta_2x_{12})^2 + (y_2 - \hat\beta_1x_{21} - \hat
beta_2x_{22})^2 + \lambda (\hat\beta_1^2 + \hat\beta_2^2)$ Part b) Step 1: Expanding the equation from Part a: $$(y_1 - \hat\beta_1 x_{11} - \hat\beta_2 x_{12})^2 + (y_2 - \hat\beta_1 x_{21} - \hat\beta_2 x_{22})^

  • \lambda (\hat\beta_1^2 + \hat\beta_2^2) \ = (y_1^2 + \hat\beta_1^2 x_{11}^2 + \hat\beta_2^2 x_{12}^2 - 2 \hat\beta_1 x_{11} y_1 - 2 \hat\beta_ x_{12} y_1 + 2 \hat\beta_1 \hat\beta_2 x_{11} x_{12}) \
  • (y_2^2 + \hat\beta_1^2 x_{21}^2 + \hat\beta_2^2 x_{22}^2 - 2 \hat\beta_1 x_{21} y_2 - 2 \hat\beta_ x_{22} y_2 + 2 \hat\beta_1 \hat\beta_2 x_{21} x_{22}) \
  • \lambda \hat\beta_1^2 + \lambda \hat\beta_2^2$$ Step 2: Taking the partial deritive to $\hat\beta_1$ and setting equation to 0 to minimize: $$\frac{\partial }{\partial \hat\beta_1}: (2\hat\beta_1x_{11}^2-2x_{11}y_1+2\hat\beta_2x_{11}x_{12}) + (2\hat\beta_1x_{21}^2-2x_{21}y_2+2\hat\beta_2x_{21}x_{22}) + 2\lambda\hat\beta_1 = 0$$ Step 3: Setting $x_{11}=x_{12}=x_1$ and $x_{21}=x_{22}=x_2$ and dividing both sides of the equation by 2: $$(\hat\beta_1x_1^2-x_1y_1+\hat\beta_2x_1^2) + (\hat\beta_1x_2^2-x_2y_2+\hat\beta_2x_2^2) +
    lambda\hat\beta_1 = 0$$ $$\hat\beta_1 (x_1^2+x_2^2) + \hat\beta_2 (x_1^2+x_2^2) + \lambda\hat\beta_1 = x_1y_1 + x_2y_2$$ Step 4: Add $2\hat\beta_1x_1x_2$ and $2\hat\beta_2x_1x_2$ to both sides of the equation: $$\hat\beta_1 (x_1^2 + x_2^2 + 2x_1x_2) + \hat\beta_2 (x_1^2 + x_2^2 + 2x_1x_2) + \lambda\hat
    beta_1 = x_1y_1 + x_2y_2 + 2\hat\beta_1x_1x_2 + 2\hat\beta_2x_1x_2 \

$$\lambda\frac{|\beta_1|}{\beta_1} = \lambda\frac{|\beta_2|}{\beta_2}$$ So it seems that the lasso just requires that $\beta_1$ and $\beta_2$ are both positive or both negative (ignoring possibility of 0...)


EXERCISE 6: Part a)

betas <- seq(-10,10,0.1) eq.ridge <- function(beta, y=7, lambda=10) (y-beta)^2 + lambda*beta^ plot(betas, eq.ridge(betas), xlab="beta", main="Ridge Regression Optimization", pch=1) points(5/(1+10), eq.ridge(7/(1+10)), pch=16, col="red", cex=2) 

For $y=7$ and $\lambda=10$, $\hat\beta=\frac{7}{1+10}$ minimizes the ridge regression equation Part b)

betas <- seq(-10,10,0.1) eq.lasso <- function(beta, y=7, lambda=10) (y-beta)^2 + lambda*abs(beta) plot(betas, eq.lasso(betas), xlab="beta", main="Lasso Regression Optimization", pch=1) points(7-10/2, eq.lasso(7-10/2), pch=16, col="red", cex=2) ## ``` For $y=7$ and $\lambda=10$, $\hat\beta=7-\frac{10}{2}$ minimizes the ridge regression equation *** <a id="ex07"></a> >EXERCISE 7: __Part a)__ [*... will come back to this. maybe.*] __Part b)__ [*... will come back to this. maybe.*] __Part c)__ [*... will come back to this. maybe.*] __Part d)__ [*... will come back to this. maybe.*] __Part e)__ [*... will come back to this. maybe.*] min.cp <- which.min(reg.summary$cp) plot(reg.summary$cp, xlab="Number of Poly(X)", ylab="Best Subset Cp", type="l") points(min.cp, reg.summary$cp[min.cp], col="red", pch=4, lwd=5) min.bic <- which.min(reg.summary$bic) plot(reg.summary$bic, xlab="Number of Poly(X)", ylab="Best Subset BIC", type="l") points(min.bic, reg.summary$bic[min.bic], col="red", pch=4, lwd=5) min.adjr2 <- which.max(reg.summary$adjr2) plot(reg.summary$adjr2, xlab="Number of Poly(X)", ylab="Best Subset Adjusted R^2", type="l") points(min.adjr2, reg.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5) coef(regfit.full, min.cp) coef(regfit.full, min.bic) coef(regfit.full, min.adjr2) 

Part d)

# forward selection regfit.fwd <- regsubsets(Y~poly(X,10,raw=T), data=data.frame(Y,X), nvmax=10) (fwd.summary <- summary(regfit.fwd)) # backward selection regfit.bwd <- regsubsets(Y~poly(X,10,raw=T), data=data.frame(Y,X), nvmax=10) (bwd.summary <- summary(regfit.bwd)) par(mfrow=c(3,2)) min.cp <- which.min(fwd.summary$cp) plot(fwd.summary$cp, xlab="Number of Poly(X)", ylab="Forward Selection Cp", type="l") points(min.cp, fwd.summary$cp[min.cp], col="red", pch=4, lwd=5) min.cp <- which.min(bwd.summary$cp) plot(bwd.summary$cp, xlab="Number of Poly(X)", ylab="Backward Selection Cp", type="l") points(min.cp, bwd.summary$cp[min.cp], col="red", pch=4, lwd=5) min.bic <- which.min(fwd.summary$bic) plot(fwd.summary$bic, xlab="Number of Poly(X)", ylab="Forward Selection BIC", type="l") points(min.bic, fwd.summary$bic[min.bic], col="red", pch=4, lwd=5) min.bic <- which.min(bwd.summary$bic) plot(bwd.summary$bic, xlab="Number of Poly(X)", ylab="Backward Selection BIC", type="l") points(min.bic, bwd.summary$bic[min.bic], col="red", pch=4, lwd=5) min.adjr2 <- which.max(fwd.summary$adjr2) plot(fwd.summary$adjr2, xlab="Number of Poly(X)", ylab="Forward Selection Adjusted R^2", type="l") points(min.adjr2, fwd.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5) min.adjr2 <- which.max(bwd.summary$adjr2) plot(bwd.summary$adjr2, xlab="Number of Poly(X)", ylab="Backward Selection Adjusted R^2", type="l") points(min.adjr2, bwd.summary$adjr2[min.adjr2], col="red", pch=4, lwd=5) # coefficients of selected models coef(regfit.fwd, which.min(fwd.summary$cp)) coef(regfit.bwd, which.min(bwd.summary$cp)) coef(regfit.fwd, which.min(fwd.summary$bic)) coef(regfit.bwd, which.min(bwd.summary$bic)) coef(regfit.fwd, which.max(fwd.summary$adjr2)) coef(regfit.bwd, which.max(bwd.summary$adjr2)) 

Best subset, foward selection and backward selection all resulted in the same best models Part e)

require(glmnet) xmat <- model.matrix(Y~poly(X,10,raw=T))[,-1] # lasso regression xmat <- model.matrix(Y2~poly(X,10,raw=T))[,-1] lasso.mod <- cv.glmnet(xmat, Y2, alpha=1) (lambda <- lasso.mod$lambda.min) par(mfrow=c(1,1)) plot(lasso.mod) predict(lasso.mod, s=lambda, type="coefficients") 

Lasso selects the correct model but best subset diagnostics indicate using 1 to 4 predictors


EXERCISE 9: Part a)

require(ISLR) data(College) set.seed(1) trainid <- sample(1:nrow(College), nrow(College)/2) train <- College[trainid,] test <- College[-trainid,] str(College) 

Part b)

fit.lm <- lm(Apps~., data=train) pred.lm <- predict(fit.lm, test) (err.lm <- mean((test$Apps - pred.lm)^2)) # test error 

Part c)

require(glmnet) xmat.train <- model.matrix(Apps~., data=train)[,-1] xmat.test <- model.matrix(Apps~., data=test)[,-1] fit.ridge <- cv.glmnet(xmat.train, train$Apps, alpha=0) (lambda <- fit.ridge$lambda.min) # optimal lambda pred.ridge <- predict(fit.ridge, s=lambda, newx=xmat.test) (err.ridge <- mean((test$Apps - pred.ridge)^2)) # test error 

Part d)

require(glmnet) xmat.train <- model.matrix(Apps~., data=train)[,-1] xmat.test <- model.matrix(Apps~., data=test)[,-1] fit.lasso <- cv.glmnet(xmat.train, train$Apps, alpha=1) (lambda <- fit.lasso$lambda.min) # optimal lambda pred.lasso <- predict(fit.lasso, s=lambda, newx=xmat.test) (err.lasso <- mean((test$Apps - pred.lasso)^2)) # test error __Part g)__ ```{r} err.all <- c(err.lm, err.ridge, err.lasso, err.pcr, err.pls) names(err.all) <- c("lm", "ridge", "lasso", "pcr", "pls") barplot(err.all ) 

The test errors aren't much different. The ridge and lasso seem to perform slightly better while the PCR/ PLS don't show any improvement from the full linear regression model.

plot(test$Apps, pred.lm) 

EXERCISE 10: Part a)

set.seed(1) eps <- rnorm(1000) xmat <- matrix(rnorm(1000*20), ncol=20) betas <- sample(-5:5, 20, replace=TRUE) betas[c(3,6,7,10,13,17)] <- 0 betas y <- xmat %*% betas + eps 

Part b)

set.seed(1) trainid <- sample(1:1000, 100, replace=FALSE) xmat.train <- xmat[trainid,] xmat.test <- xmat[-trainid,] y.train <- y[trainid,] y.test <- y[-trainid,] train <- data.frame(y=y.train, xmat.train) test <- data.frame(y=y.test, xmat.test) 

Part c)

require(leaps) # predict function from chapter 6 labs predict.regsubsets <- function(object, newdata, id, ...){ form <- as.formula(object$call[[2]]) mat <- model.matrix(form, newdata) coefi <- coef(object, id=id) xvars <- names(coefi) mat[,xvars]%*%coefi } regfit.full <- regsubsets(y~., data=train, nvmax=20) (coef.best <- coef(regfit.full, id=which.min(err.full))) betas[betas != 0] names(betas) <- paste0("X", 1:20) merge(data.frame(beta=names(betas),betas), data.frame(beta=names(coef.best),coef.best), all.x=T, sort=F) 

The best subset model selected all the correct predictors Part g)

err.coef <- rep(NA, 20) for(i in 1:20) { coef.i <- coef(regfit.full, id=i) df.err <- merge(data.frame(beta=names(betas),betas), data.frame(beta=names(coef.i),coef.i), all.x=T) df.err[is.na(df.err[,3]),3] <- 0 err.coef[i] <- sqrt(sum((df.err[,2] - df.err[,3])^2)) } plot(1:20, err.coef, type="b", main="Coefficient Error", xlab="Number of Predictors") points(which.min(err.coef), err.coef[which.min(err.coef)], col="red", pch=16) 

The coefficient error plot shows a very similar plot to the test error plot


>EXERCISE 11:

Part a)

require(leaps) # forward and backward selection require(glmnet) # ridge and lasso require(MASS) # Boston data set data(Boston) # split data into training and test sets set.seed(1) trainid <- sample(1:nrow(Boston), nrow(Boston)/2) train <- Boston[trainid,] test <- Boston[-trainid,] xmat.train <- model.matrix(crim~., data=train)[,-1] xmat.test <- model.matrix(crim~., data=test)[,-1] str(Boston) # ridge regression model fit.ridge <- cv.glmnet(xmat.train, train$crim, alpha=0) (lambda <- fit.ridge$lambda.min) # optimal lambda pred.ridge <- predict(fit.ridge, s=lambda, newx=xmat.test) (err.ridge <- mean((test$crim - pred.ridge)^2)) # test error predict(fit.ridge, s=lambda, type="coefficients") # lasso regression model fit.lasso <- cv.glmnet(xmat.train, train$crim, alpha=1) (lambda <- fit.lasso$lambda.min) # optimal lambda pred.lasso <- predict(fit.lasso, s=lambda, newx=xmat.test) (err.lasso <- mean((test$crim - pred.lasso)^2)) # test error predict(fit.lasso, s=lambda, type="coefficients") # predict function from chapter 6 labs