1. We will perform k-nearest-neighbors in this problem, in a setting with 2 classes,
25 observations per class, and p = 2 features. We will call one class the “red”
class and the other class the “blue” class. The observations in the red class
are drawn i.i.d. from a Np(µr, I) distribution, and the observations in the blue
class are drawn i.i.d. from a Np(µb, I) distribution, where µr =
0
0
is the
mean in the red class, and where µb =
1.5
1.5
is the mean in the blue class.
(a) Generate a training set, consisting of 25 observations from the red class
and 25 observations from the blue class. (You will want to use the R
function rnorm.) Plot the training set. Make sure that the axes are
properly labeled, and that the observations are colored according to their
class label.
(b) Now generate a test set consisting of 25 observations from the red class
and 25 observations from the blue class. On a single plot, display both the
training and test set, using one symbol to indicate training observations
(e.g. circles) and another symbol to indicate the test observations (e.g.
squares). Make sure that the axes are properly labeled, that the symbols
for training and test observations are explained in a legend, and that the
observations are colored according to their class label.
(c) Using the knn function in the library class, fit a k-nearest neighbors
model on the training set, for a range of values of k from 1 to 20. Make a
plot that displays the value of 1/k on the x-axis, and classification error
(both training error and test error) on the y-axis. Make sure all axes and
curves are properly labeled. Explain your results.
1
(d) For the value of k that resulted in the smallest test error in part (c)
above, make a plot displaying the test observations as well as their true
and predicted class labels. Make sure that all axes and points are clearly
labeled.
(e) In this example, what is the Bayes error rate? Justify your answer.
2. We will once again perform k-nearest-neighbors in a setting with p = 2 features.
But this time, we’ll generate the data differently: let X1 ∼ Unif[0, 1] and
X2 ∼ Unif[0, 1], i.e. the observations for each feature are i.i.d. from a uniform
distribution. An observation belongs to class “red” if (X1−0.5)2+(X2−0.5)2 >
0.15 and X1 > 0.5; to class “green” if (X1 − 0.5)2 + (X2 − 0.5)2 > 0.15 and
X1 ≤ 0.5; and to class “blue” otherwise.
(a) Generate a training set of n = 200 observations. (You will want to use
the R function runif.) Plot the training set. Make sure that the axes are
properly labeled, and that the observations are colored according to their
class label.
(b) Now generate a test set consisting of another 200 observations. On a single
plot, display both the training and test set, using one symbol to indicate
training observations (e.g. circles) and another symbol to indicate the
test observations (e.g. squares). Make sure that the axes are properly
labeled, that the symbols for training and test observations are explained
in a legend, and that the observations are colored according to their class
label.
(c) Using the knn function in the library class, fit a k-nearest neighbors
model on the training set, for a range of values of k from 1 to 50. Make a
plot that displays the value of 1/k on the x-axis, and classification error
(both training error and test error) on the y-axis. Make sure all axes and
curves are properly labeled. Explain your results.
(d) For the value of k that resulted in the smallest test error in part (c)
above, make a plot displaying the test observations as well as their true
and predicted class labels. Make sure that all axes and points are clearly
labeled.
(e) In this example, what is the Bayes error rate? Justify your answer, and
explain how it relates to your findings in (c) and (d).
3. For each scenario, determine whether it is a regression or a classification problem, determine whether the goal is inference or prediction, and state the values
of n (sample size) and p (number of predictors).
(a) I want to predict each student’s final exam score based on his or her
homework scores. There are 50 students enrolled in the course, and each
student has completed 8 homeworks.
2
(b) I want to understand the factors that contribute to whether or not a
student passes this course. The factors that I consider are (i) whether or
not the student has previous programming experience; (ii) whether or not
the student has previously studied linear algebra; (iii) whether or not the
student has taken a previous stats/probability course; (iv) whether or not
the student attends office hours; (v) the student’s overall GPA; (vi) the
student’s year (e.g. freshman, sophomore, junior, senior, or grad student).
I have data for all 50 students enrolled in the course.
4. In each setting, would you generally expect a flexible or an inflexible statistical
machine learning method to perform better? Justify your answer.
(a) Sample size n is very small, and number of predictors p is very large.
(b) Sample size n is very large, and number of predictors p is very small.
(c) Relationship between predictors and response is highly non-linear.
(d) The variance of the error terms, i.e. σ
2 = Var(), is extremely high.
5. This question has to do with the bias-variance decomposition.
(a) Make a sketch of typical (squared) bias, variance, training error, test error,
and Bayes (or irreducible) error curves, on a single plot, as we go from
less flexible statistical learning methods to more flexible approaches. The
x-axis should represent the amount of flexibility in the model, and the
y-axis should represent the values of each curve. There should be five
curves. Make sure to label each one.
(b) Explain why each of the five curves has the shape displayed in (a).
6. This exercise involves the Boston housing data set, which is part of the MASS
library in R.
(a) How many rows are in this data set? How many columns? What do the
rows and columns represent?
(b) Make some pairwise scatterplots of the predictors (columns) in this data
set. Describe your findings.
(c) Are any of the predictors associated with per capita crime rate? If so,
explain the relationship.
(d) Do any of the suburbs of Boston appear to have particularly high crime
rates? Tax rates? Pupil-teacher ratios? Comment on the range of each
predictor.
(e) How many of the suburbs in this data set bound the Charles river?
(f) What are the mean and standard deviation of the pupil-teacher ratio
among the towns in this data set?
3
(g) Which suburb of Boston has highest median value of owner-occupied
homes? What are the values of the other predictors for that suburb, and
how do those values compare to the overall ranges for those predictors?
Comment on your findings.
(h) In this data set, how many of the suburbs average more than six rooms per
dwelling? More than eight rooms per dwelling? Comment on the suburbs
that average more than eight rooms per dwelling.
1. Suppose we have a quantitative response Y , and a single feature X ∈ R. Let
RSS1 denote the residual sum of squares that results from fitting the model
Y = β0 + β1X +
using least squares. Let RSS12 denote the residual sum of squares that results
from fitting the model
Y = β0 + β1X + β2X
2 +
using least squares.
(a) Prove that RSS12 ≤ RSS1.
(b) Prove that the R2 of the model containing just the feature X is no greater
than the R2 of the model containing both X and X2
.
2. Describe the null hypotheses to which the p-values in Table 3.4 of the textbook correspond. Explain what conclusions you can draw based on these pvalues. Your explanation should be phrased in terms of sales, TV, radio, and
newspaper, rather than in terms of the coefficients of the linear model.
3. Consider a linear model with just one feature,
Y = β0 + β1X + .
Suppose we have n observations from this model, (x1, y1), . . . ,(xn, yn). The
least squares estimator is given in (3.4) of the textbook. Furthermore, we saw
1
in class that if we construct a n × 2 matrix X˜ whose first column is a vector of
1’s and whose second column is a vector with elements x1, . . . , xn, and if we let
y denote the vector with elements y1, . . . , yn, then the least squares estimator
takes the form
βˆ
0
βˆ
1
=
X˜ TX˜
−1
X˜ T y. (1)
Prove that (1) agrees with equation (3.4) of the textbook, i.e. βˆ
0 and βˆ
1 in (1)
equal βˆ
0 and βˆ
1 in (3.4).
4. This question involves the use of multiple linear regression on the Auto data
set, which is available as part of the ISLR library.
(a) Use the lm() function to perform a multiple linear regression with mpg as
the response and all other variables except name as the predictors. Use
the summary() function to print the results. Comment on the output. For
instance:
i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship
to the response?
iii. Provide an interpretation for the coefficient associated with the variable year.
Make sure that you treat the qualitative variable origin appropriately.
(b) Try out some models to predict mpg using functions of the variable horsepower.
Comment on the best model you obtain. Make a plot with horsepower
on the x-axis and mpg on the y-axis that displays both the observations
and the fitted function (i.e. ˆf(horsepower)).
(c) Now fit a model to predict mpg using horsepower, origin, and an interaction between horsepower and origin. Make sure to treat the qualitative
variable origin appropriately. Comment on your results. Provide a careful interpretation of each regression coefficient.
5. Consider fitting a model to predict credit card balance using income and
student, where student is a qualitative variable that takes on one of three
values: student∈ {graduate, undergraduate, not student}.
(a) Encode the student variable using two dummy variables, one of which
equals 1 if student=graduate (and 0 otherwise), and one of which equals
1 if student=undergraduate (and 0 otherwise). Write out an expression
for a linear model to predict balance using income and student, using
this coding of the dummy variables. Interpret the coefficients in this linear
model.
(b) Now encode the student variable using two dummy variables, one of which
equals 1 if student=not student (and 0 otherwise), and one of which
2
equals 1 if student=graduate (and 0 otherwise). Write out an expression
for a linear model to predict balance using income and student, using
this coding of the dummy variables. Interpret the coefficients in this linear
model.
(c) Using the coding in (a), write out an expression for a linear model to predict balance using income, student, and an interaction between income
and student. Interpret the coefficients in this model.
(d) Using the coding in (b), write out an expression for a linear model to predict balance using income, student, and an interaction between income
and student. Interpret the coefficients in this model.
(e) Using simulated data for balance, income, and student, show that the
fitted values (predictions) from the models in (a)–(d) do not depend on
the coding of the dummy variables (i.e. the models in (a) and (b) yield
the same fitted values, as do the models in (c) and (d)).
6. Extra Credit. Consider a linear model with just one feature,
Y = β0 + β1X + ,
with E() = 0 and Var() = σ
2
. Suppose we have n observations from this
model, (x1, y1), . . . ,(xn, yn). We assume that x1, . . . , xn are fixed, so the only
randomness in the model comes from 1, . . . , n. Use (3.4) in the textbook
— or, if you prefer, the matrix algebra formulation in (1) of this homework
assignment — in order to derive the expressions for Var(βˆ
0) and Var(βˆ
1) given
in (3.8) of the textbook.
1. A random variable X has an Exponential(λ) distribution if its probability density function is of the form
f(x) = (
λe−λx if x > 0
0 if x ≤ 0
,
where λ > 0 is a parameter. Furthermore, the mean of an Exponential(λ)
random variable is 1/λ.
Now, consider a classification problem with K = 2 classes and a single feature
X ∈ R. If an observation is in class 1 (i.e. Y = 1) then X ∼ Exponential(λ1).
And if an observation is in class 2 (i.e. Y = 2) then X ∼ Exponential(λ2). Let
π1 denote the probability that an observation is in class 1, and let π2 = 1 − π1.
(a) Derive an expression for Pr(Y = 1 | X = x). Your answer should be in
terms of x, λ1, λ2, π1, π2.
(b) Write a simple expression for the Bayes classifier decision boundary, i.e.,
an expression for the set of x such that Pr(Y = 1 | X = x) = Pr(Y = 2 |
X = x).
(c) For part (c) only, suppose λ1 = 2, λ2 = 7, π1 = 0.5. Make a plot of
feature space. Clearly label:
i. the region of feature space corresponding to the Bayes classifier decision boundary,
ii. the region of feature space for which the Bayes classifier will assign
an observation to class 1,
1
iii. the region of feature space for which the Bayes classifier will assign
an observation to class 2.
(d) Now suppose that we observe n independent training observations,
(x1, y1), . . . ,(xn, yn).
Provide simple estimators for λ1, λ2, π1, π2, in terms of the training
observations.
(e) Given a test observation X = x0, provide an estimate of
P(Y = 1 | X = x0).
Your answer should be written only in terms of the n training observations
(x1, y1), . . . ,(xn, yn), and the test observation x0, and not in terms of any
unknown parameters.
2. We collect some data for students in a statistics class, with predictors X1 =
number of lectures attended, X2 = average number of hours studied per week,
and response Y = receive an A. We fit a logistic regression model, and get
coefficient estimates βˆ
0, βˆ
1, βˆ
2.
(a) Write out an expression for the probability that a student gets an A, as a
function of the number of lectures she attended, and the average number
of hours she studied per week. Your answer should be written in terms of
X1, X2, βˆ
0, βˆ
1, βˆ
2.
(b) Write out an expression for the minimum number of hours a student should
study per week in order to have at least an 80% chance of getting an A.
Your answer should be written in terms of X1, X2, βˆ
0, βˆ
1, βˆ
2.
(c) Based on a student’s value of X1 and X2, her predicted probability of
getting an A in this course is 60%. If she increases her studying by one
hour per week, then what will be her predicted probability of getting an
A in this course?
3. When the number of features p is large, there tends to be a deterioration in
the performance of K-nearest neighbors (KNN) and other approaches that
perform prediction using only observations that are near the test observation
for which a prediction must be made. This phenomenon is known as the curse
of dimensionality. We will now investigate this curse.
(a) Suppose that we have a set of observations, each with measurements on
p = 1 feature, X. We assume that X is uniformly distributed on [0, 1].
Associated with each observation is a response value. Suppose that we
wish to predict a test observation’s response using only observations that
are within 10% of the range of X closest to that test observation. For
instance, in order to predict the response for a test observation with X =
0.6, we will use observations in the range [0.55, 0.65]. On average, what
fraction of the available observations will we use to make the prediction?
2
(b) Now suppose that we have a set of observations, each with measurements
on p = 2 features, X1 and X2. We assume that (X1, X2) are uniformly distributed on [0, 1] × [0, 1]. We wish to predict a test observation’s response
using only observations that are within 10% of the range of X1 and within
10% of the range of X2 closest to that test observation. For instance, in
order to predict the response for a test observation with X1 = 0.6 and
X2 = 0.35, we will use observations in the range [0.55, 0.65] for X1 and
in the range [0.3, 0.4] for X2. On average, what fraction of the available
observations will we use to make the prediction?
(c) Now suppose that we have a set of observations on p = 100 features. Again
the observations are uniformly distributed on each feature, and again each
feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range
that is closest to that test observation. What fraction of the available
observations will we use to make the prediction?
(d) Using your answers to parts (a)-(c), argue that a drawback of KNN when
p is large is that there are very few training observations “near” any given
test observation.
(e) Now suppose that we wish to make a prediction for a test observation by
creating a p-dimensional hypercube centered around the test observation
that contains, on average, 10% of the training observations. For p = 1, 2,
and 100, what is the length of each side of the hypercube? Comment on
your answer.
Note: A hypercube is a generalization of a cube to an arbitrary number
of dimensions. When p = 1, a hypercube is simply a line segment, when
p = 2 it is a square.
4. Pick a data set of your choice. It can be chosen from the ISLR package (but
not one of the data sets explored in the Chapter 4 lab, please!), or it can
be another data set that you choose. Choose a binary qualitative variable in
your data set to be the response, Y . (By binary qualitative variable, I mean
a qualitative variable with K = 2 classes.) If your data set doesn’t have any
binary qualitative variables, then you can create one (e.g. by dichotomizing
a continuous variable: create a new variable that equals 1 or 0 depending on
whether the continuous variable takes on values above or below its median). I
suggest selecting a data set with n p.
(a) Describe the data. What are the values of n and p? What are you trying
to predict, i.e. what is the meaning of Y ? What is the meaning of the
features?
(b) Split the data into a training set and a test set. Perform LDA on the
training set in order to predict Y using the features. What is the training
error of the model obtained? what is the test error?
3
(c) Perform QDA on the training set in order to predict Y using the features.
What is the training error of the model obtained? what is the test error?
(d) Perform logistic regression on the training set in order to predict Y using
the features. What is the training error of the model obtained? what is
the test error?
(e) Perform KNN on the training set in order to predict Y using the features.
What is the training error of the model obtained? what is the test error?
(f) Comment on your results.
1. Consider the validation set approach, with a 50/50 split into training and
validation sets:
(a) Suppose you perform the validation set approach twice, each time with a
different random seed. What’s the probability that an observation, chosen
at random, is in both of those training sets?
(b) If you perform the validation set approach repeatedly, will you get the
same result each time? Explain your answer.
2. Consider K-fold cross-validation:
(a) Consider the observations in the 1st fold’s training set, and the observations in the 2nd fold’s training set. What’s the probability that an
observation, chosen at random, is in both of those training sets?
(b) If you perform K-fold CV repeatedly, will you get the same result each
time? Explain your answer.
3. Now consider leave-one-out cross-validation:
(a) Consider the observations in the 1st fold’s training set, and the observations in the 2nd fold’s training set. What’s the probability that an
observation, chosen at random, is in both of those training sets?
(b) If you perform leave-one-out cross-validation repeatedly, will you get the
same result each time? Explain your answer.
1
4. Consider a very simple model,
Y = β + ,
where Y is a scalar response variable, β ∈ R is an unknown parameter, and
is a noise term with E() = 0, V ar() = σ
2
. Our goal is to estimate β. Assume
that we have n observations with uncorrelated errors.
(a) Suppose that we perform least squares regression using all n observations.
Prove that the least squares estimator, βˆ, equals 1
n
Pn
i=1 Yi
.
(b) Suppose that we perform least squares using all n observations. Prove
that the least squares estimator, βˆ, has variance σ
2/n.
(c) Consider the least squares estimator of β fit using just n/2 observations.
What is the variance of this estimator?
(d) Consider the least squares estimator of β fit using n(K − 1)/K observations, for some K > 2. What is the variance of this estimator?
(e) Consider the least squares estimator of β fit using n − 1 observations.
What is the variance of this estimator?
(f) Derive an expression for E(βˆ), where βˆ is the least squares estimator fit
using all n observations.
(g) Using your results from the earlier sections of this question, argue that the
validation set approach tends to over -estimate the expected test error.
(h) Using your results from the earlier sections of this question, argue that
leave-one-out cross-validation does not substantially over-estimate the expected test error, provided that n is large.
(i) Using your results from the earlier sections of this question, argue that
K-fold CV provides an over-estimate of the expected test error that is
somewhere between the big over-estimate resulting from the validation
set approach and the very mild over-estimate resulting from leave-one-out
CV.
5. As in the previous problem, assume
Y = β + ,
where Y is a scalar response variable, β ∈ R is an unknown parameter, and
is a noise term with E() = 0, V ar() = σ
2
. Our goal is to estimate β. Assume
that we have n observations with uncorrelated errors.
(a) Suppose that we perform K-fold cross-validation. What is the correlation
between βˆ1
, the least squares estimator of β that we obtain from the 1st
fold, and βˆ2
, the least squares estimator of β that we obtain from the 2nd
fold?
2
(b) Suppose that we perform the validation set approach twice, each time
using a different random seed. Assume further that exactly 0.25n observations overlap between the two training sets. What is the correlation
between βˆ1
, the least squares estimator of β that we obtain the first time
that we perform the validation set approach, and βˆ2
, the least squares estimator of β that we obtain the second time that we perform the validation
set approach?
(c) Now suppose that we perform leave-one-out cross-validation. What is the
correlation between βˆ1
, the least squares estimator of βˆ that we obtain
from the 1st fold, and βˆ2
, the least squares estimator of β that we obtain
from the 2nd fold?
Remark 1: Problem 5 indicates that the βˆ’s that you estimate using LOOCV
are very correlated with each other.
Remark 2: You might remember from an earlier stats class that if X1, . . . , Xn
are uncorrelated with variance σ
2 and mean µ, then the variance of 1
n
Pn
i=1 Xi
equals σ
2/n. But if Cor(Xi
, Xk) = σ
2
, then the variance of 1
n
Pn
i=1 Xi is quite
a bit higher.
Remark 3: Together, problems 4 and 5 might give you some intuition for the
following: LOOCV results in an approximately unbiased estimator of expected
test error (if n is large), but this estimator has high variance. In contrast, Kfold CV results in an estimator of expected test error that has higher bias, but
lower variance.
1. In this exercise, you will generate simulated data, and will use this data to
perform best subset selection.
(a) Use the rnorm() function to generate a predictor X of length n = 100,
and a noise vector of length n = 100.
(b) Generate a response vector Y of length n = 100 according to the model
Y = 3 − 2X + X
2 + .
(c) Use the regsubsets() function to perform best subset selection, considering X, X2
, . . . , X7 as candidate predictors. Make a plot like Figure 6.2
in the textbook. What is the overall best model according to Cp, BIC,
and adjusted R2
? Report the coefficients of the best model obtained.
Comment on your results.
(d) Repeat (c) using forward stepwise selection instead of best subset selection.
(e) Repeat (c) using backward stepwise selection instead of best subset selection.
Hint: You may need to use the data.frame() function to create a single data
set containing both X and Y .
2. In class, we discussed the fact that if you choose a model using stepwise selection
on a data set, and then fit the selected model using least squares on the same
data set, then the resulting p-values output by R are highly misleading. We’ll
now see this through simulation.
1
(a) Use the rnorm() function to generate vectors X1, X2, . . . , X100 and , each
of length n = 1000. (Hint: use the matrix() function to create a 1000 ×
100 data matrix.)
(b) Generate data according to
Y = β0 + β1X1 + . . . + β100X100 + ,
where β1 = . . . = β100 = 0.
(c) Fit a least squares regression model to predict Y using X1, . . . , Xp. Make a
histogram of the p-values associated with the null hypotheses H0j
: βj = 0
for j = 1, . . . , 100.
Hint: You can easily access these p-values using the command
(summary(lm(y~X)))$coef[,4].
(d) Recall that under H0j
: βj = 0, we expect the p-values to have a Unif[0, 1]
distribution. In light of this fact, comment on your results in (c). Do any
of the features appear to be significantly associated with the response?
(e) Perform forward stepwise selection in order to identify M2, the best twovariable model. (For this problem, there is no need to calculate the best
model Mk for k 6= 2.) Then fit a least squares regression model to the
data, using just the features in M2. Comment on the p-values obtained
for the coefficients.
(f) Now generate another 1000 observations by repeating the procedure in (a)
and (b). Using the new observations, fit a least squares linear model to
predict Y using just the features in M2 calculated in (e). (Do not perform
forward stepwise selection again using the new observations! Instead, take
the M2 obtained earlier in this problem.) Comment on the p-values for
the coefficients. How do they compare to the p-values in (e)?
(g) Are the features in M2 significantly associated with the response? Justify
your answer.
THE BOTTOM LINE: If you showed a friend the p-values obtained in (e),
without explaining that you obtained M2 by performing forward stepwise selection on this same data, then he or she might incorrectly conclude that the
features in M2 are highly associated with the response.
3. Let’s consider doing least squares and ridge regression under a very simple
setting, in which p = 1, and Pn
i=1 yi =
Pn
i=1 xi = 0. We consider regression
without an intercept. (It’s usually a bad idea to do regression without an
intercept, but if our feature and response each have mean zero, then it is okay
to do this!)
(a) The least squares solution is the value of β ∈ R that minimizes
Xn
i=1
(yi − βxi)
2
.
2
Write out an analytical (closed-form) expression for this least squares
solution. Your answer should be a function of x1, . . . , xn and y1, . . . , yn.
Hint: Calculus!!
(b) For a given value of λ, the ridge regression solution minimizes
Xn
i=1
(yi − βxi)
2 + λβ2
.
Write out an analytical (closed-form) expression for the ridge regression
solution, in terms of x1, . . . , xn and y1, . . . , yn and λ.
(c) Suppose that the true data-generating model is
Y = 3X + ,
where has mean zero, and X is fixed (non-random). What is the expectation of the least squares estimator from (a)? Is it biased or unbiased?
(d) Suppose again that the true data-generating model is Y = 3X + , where
has mean zero, and X is fixed (non-random). What is the expectation of
the ridge regression estimator from (b)? Is it biased or unbiased? Explain
how the bias changes as a function of λ.
(e) Suppose that the true data-generating model is Y = 3X + , where
has mean zero and variance σ
2
, and X is fixed (non-random), and also
Cov(i
, i
0)= 0 for all i 6= i
0
. What is the variance of the least squares
estimator from (a)?
(f) Suppose that the true data-generating model is Y = 3X + , where
has mean zero and variance σ
2
, and X is fixed (non-random), and also
Cov(i
, i
0)= 0 for all i 6= i
0
. What is the variance of the ridge estimator
from (b)? How does the variance change as a function of λ?
(g) In light of your answers to parts (d) and (f), argue that λ in ridge regression allows us to control model complexity by trading off bias for variance.
Hint: For this problem, you might want to brush up on some basic properties
of means and variances! For instance, if Cov(Z, W) = 0, then V ar(Z + W) =
V ar(Z) + V ar(W). And if a is a constant, then V ar(aW) = a
2V ar(W), and
V ar(a + W) = V ar(W).
4. Suppose that you collect data to predict Y (height in inches) using X (weight
in pounds). You fit a least squares model to the data, and you get
Yˆ = 3.1 + 0.57X.
(a) Suppose you decide that you want to measure weight in ounces instead
of pounds. Write out the least squares model for predicting Y using
X˜ (weight in ounces). (You should calculate the coefficient estimates
explicitly.) Hint: there are 16 ounces in a pound!
3
(b) Consider fitting a least squares model to predict Y using X and X˜. Let β
denote the coefficient for X in the least squares model, and let β˜ denote
the coefficient for X˜. Argue that any equation of the form
Yˆ = 3.1 + βX + β˜X, ˜
where β + 16β˜ = 0.57, is a valid least squares model.
(c) Suppose that you use ridge regression to predict Y using X, using some
value of λ, and obtain the fitted model
Yˆ = 3.1 + 0.4X.
Now consider fitting a ridge regression model to predict Y using X˜, again
using that same value of λ. Will the coefficient of X˜ be equal to 0.4/16,
greater than 0.4/16, or less than 0.4/16? Explain your answer.
(d) For the same value of λ considered in (c), suppose you perform ridge regression to predict Y using X, and separately you perform ridge regression
to predict Y using X˜. Which fitted model will have smaller residual sum
of squares (on the training set)? Explain your answer.
(e) Finally, suppose you use ridge regression to predict Y using X and X˜,
using some value of λ (not necessarily the same value of λ used in (d)),
and obtain the fitted model
Yˆ = 3.17 + 0.03X + 0.03X. ˜
Is the following claim true or false? Explain your answer.
Claim: Any equation of the form
Yˆ = 3.17 + βX + β˜X, ˜
where β+16β˜ = 0.03+16×0.03 = 0.51, is a valid ridge regression solution
for that value of λ.
(f) Argue that your answers to the previous sub-problems support the following claim:
Claim: least squares is scale-invariant, but ridge regression is not.
5. Suppose we wish to fit a linear regression model using least squares. Let
MBSS
k
,MFW D
k
,MBW D
k denote the best k-feature models in the best subset,
forward stepwise, and backward stepwise selection procedures. (For notational
details, see Algorithms 6.1, 6.2, and 6.3 of the textbook.)
Recall that the training set residual sum of squares (or RSS for short) is defined
as Pn
i=1(yi − yˆi)
2
.
For each claim, fill in the blank with one of the following: “less than”, “less
than or equal to”, “greater than”, “greater than or equal to”, “equal to”. Say
“not enough information to tell” if it is not possible to complete the sentence
as given. Explain each of your answers.
4
(a) Claim: The RSS of MFW D
1
is the RSS of MBW D
1
.
(b) Claim: The RSS of MFW D
0
is the RSS of MBW D
0
.
(c) Claim: The RSS of MFW D
1
is the RSS of MBSS
1
.
(d) Claim: The RSS of MFW D
2
is the RSS of MBSS
1
.
(e) Claim: The RSS of MBW D
1
is the RSS of MBSS
1
.
(f) Claim: The RSS of MBW D
p
is the RSS of MBSS
p
.
(g) Claim: The RSS of MBW D
p−1
is the RSS of MBSS
p−1
.
(h) Claim: The RSS of MBW D
4
is the RSS of MBSS
4
.
(i) Claim: The RSS of MBW D
4
is the RSS of MFW D
4
.
(j) Claim: The RSS of MBW D
4
is the RSS of MBW D
3
.
6. This problem is extra credit!!!! Let y denote an n-vector of response values,
and let X denote an n × p design matrix. We can write the ridge regression
problem as
minimizeβ∈Rp
ky − Xβk
2 + λkβk
2
,
where we are omitting the intercept for convenience. Derive an analytical
(closed-form) expression for the ridge regression estimator. Your answer should
be a function of X, y, and λ.
1. For this problem, you will analyze a data set of your choice, not taken from
the ISLR package. I suggest choosing a data set that has p ≈ n or even p > n,
since you will apply methods from Chapter 6 on this data.
(a) Describe the data in words. Where did you get it from, and what is the
data about? You will perform supervised learning on this data, so you
must identify a response, Y , and features, X1, . . . , Xp. What are the values
of n and p? Describe the response and the features (e.g. what are they
measuring; are they quantitative or qualitative?). Plot some summary
statistics of the data.
(b) Split the data into a training set and a test set. What are the values of n
and p on the training set?
(c) Fit a linear model using least squares on the training set, and report the
test error obtained.
(d) Fit a ridge regression model on the training set, with λ chosen by crossvalidation. Report the test error obtained.
(e) Fit a lasso model on the training set, with λ chosen by cross-validation.
Report the test error obtained, along with the number of non-zero coefficient estimates.
(f) Fit a principal components regression model on the training set, with M
chosen by cross-validation. Report the test error obtained, along with the
value of M selected by cross-validation.
(g) Fit a partial least squares model on the training set, with M chosen by
cross-validation. Report the test error obtained, along with the value of
M selected by cross-validation.
1
(h) Comment on the results obtained. How accurately is the best model you
obtained, in terms of test error? Is there much difference among the test
errors resulting from these approaches? Which model do you prefer?
2. Define the basis functions b1(X) = I(−1 < X ≤ 1) − (2X − 1)I(1 < X ≤ 3),
b2(X) = (X + 1)I(3 < X ≤ 5) − I(5 < X ≤ 6). We fit the linear regression
model
Y = β0 + β1b1(X) + β2b2(X) + ,
and obtain coefficient estimates βˆ
0 = 2, βˆ
1 = −1, βˆ
2 = 2. Sketch the estimated
curve between X = −3 and X = 8. Note the intercepts, slopes, and other
relevant information.
1. For this problem, you will analyze a data set of your choice, not taken from the
ISLR package. Choose a data set that has n p, since you will apply methods
from Chapter 7 to this data. You will also need to have p > 1. Throughout this
problem, make sure to label your axes appropriately, and to include legends
when needed.
(a) Describe the data in words. Where did you get it from, and what is
the data about? You will perform supervised learning on this data, so
you must identify a response, Y , and features, X1, . . . , Xp. What are the
values of n and p? Describe the response and the features (e.g. what are
they measuring; are they quantitative or qualitative?).
(b) Fit a generalized additive model, Y = f1(X1) + . . . + fp(Xp) + . Use
cross-validation to choose the level of complexity. For j = 1, . . . , p, make
a scatterplot of Xj against Y , and plot ˆfj (Xj ). Comment on your results
and on the choices you made in fitting this model.
(c) Now fit a linear model, Y = β0 + β1X1 + . . . + βpXp + . For j = 1, . . . , p,
display the linear fit (Xjβˆ
j ) on top of a scatterplot of Xj against Y .
(d) Estimate the test error of the generalized additive model and the test error
of the linear model. Comment on your results. Which approach gives a
better fit to the data?
2. In this problem, we’ll play around with regression splines.
(a) Generate data as follows:
1
set.seed(7)
x <- 1:1000
y <- sin((1:1000)/100)*4+rnorm(100)
Consider the model
Y = f(X) + .
What is the form of f(X) for this simulation setting? What is the value
of Var()? What is the value of E(Y − f(X))2
?
(b) Fit regression splines for various numbers of knots to this simulated data,
in order to get spline fits ranging from very wiggly to very smooth. Make
a plot of your results, showing the raw data, the true function f(X), and
the spline fits. Be sure to include a legend containing relevant information,
and to label the axes appropriately.
(c) Based on visual inspection, how many knots seem to give the “best” fit?
Explain your answer.
(d) Now perform cross-validation in order to select the optimal number of
knots. What is the “best” number of knots? Make a plot displaying the
raw data, the true function f(X), and the spline fit ˆf(X) that uses the
number of knots selected by cross-validation. Be sure to include a legend
and to label the axes appropriately. Comment on your results.
(e) Provide an estimate of the test error, E(Y − ˆf(X))2
, associated with the
spline ˆf(·) from (d). How does this relate to your answer in (a)?
(f) Now fit a linear model of the form
Y = β0 + β1X +
to the data instead. Plot the raw data and the fitted model and the true
function f(·). Provide an estimate of the test error associated with the
fitted model. Comment on your results.
1. In this problem, you will fit some models to a data set of your choice.
(a) Find a very large data set of your choice (large n, possibly large p). Select
one quantitative variable to be your response, Y ∈ R. Describe the data.
(b) Grow a very big regression tree to the data. Plot the tree, and report its
residual sum of squares (RSS) on the (training) data.
(c) Now use cost-complexity pruning to prune the tree to have 6 leaves. Plot
the pruned tree, and report its RSS on the (training) data. How does this
compare to the RSS obtained in (b)? Explain your answer.
(d) Perform cross-validation to estimate the test error, as the tree is pruned
using cost-complexity pruning. Plot the estimated test error, as a function
of tree size. The tree size should be on the x-axis and the estimated test
error should be on the y-axis.
(e) Plot the “best” tree (with size chosen by cross-validation in (d)), fit to all
of the data. Report its RSS on the (training) data.
(f) Perform bagging, and estimate its test error.
(g) Fit a random forest, and estimate its test error.
(h) Which method (regression tree, bagging, random forest) results in the
smallest estimated test error? Comment on your results.
2. In this problem, we will consider fitting a regression tree to some data with
p = 2.
(a) Find a data set with n large, p = 2 features, and Y ∈ R. It’s OK to just
use the data from Question 1 with just two of the features.
1
(b) Grow a regression tree with 8 terminal nodes. Plot the tree.
(c) Now make a plot of feature space, showing the partition corresponding to
the tree in (b). The axes should be X1 and X2. Your plot should contain
vertical and horizontal line segments indicating the regions corresponding
to the leaves in the tree from (b). Superimpose a scatterplot of the n
observations onto this plot. This should look something like Figure 8.2 in
the textbook. Label each region with the prediction for that region.
Note: If you want, you can plot the horizontal and vertical line segments in (c)
by hand (instead of figuring out how to plot them in R).
3. This problem has to do with bagging.
(a) Consider a single regression tree with just two terminal nodes (leaves).
Suppose that the single internal node splits on X1 < c. If X1 < c then a
prediction of 13.9 is made; if X1 ≥ c then a prediction of 3.4 is made. Write
out an expression for f(·) in the regression model Y = f(X1, . . . , Xp) +
corresponding to this tree.
(b) Now suppose you bag some regression trees, each of which contain just
two terminal nodes (leaves). Show that this results in an additive model,
i.e. a model of the form
Y =
X
p
j=1
fj (Xj ) + .
(c) Now suppose you perform bagging with larger regression trees, each of
which has at least three terminal nodes (leaves). Does this result in an
additive model? Explain your answer.
4. If you’ve paid attention in class, then you know that in statistics, there is no
free lunch: depending on the form of the function f(·) in the regression model
Y = f(X1, . . . , Xp) + ,
a given statistical machine learning algorithm might work very well, or not well
at all. You will now demonstrate this in a simulation with p = 2 and n = 1000.
(a) Generate X1, X2, and as
x1 <- sample(seq(0,10,len=1000))
x2 <- sample(seq(0,10,len=1000))
eps <- rnorm(1000)
If you generate Y according to the model Y = f(X1, X2) + , then what
will be the value of the irreducible error?
2
(b) Give an example of a function f(·) for which a least squares regression
model fit to (x1, y1), . . . ,(xn, yn) can be expected to outperform a regression tree fit to (x1, y1), . . . ,(xn, yn), in terms of expected test error. Explain why you expect the least squares regression model to work better
for this choice of f(·).
(c) Now calculate Y = f(X1, X2) + in R using the x1, x2, eps generated in
(a), and the function f(·) specified in (b). Estimate the test error for a
least squares regression model, and the test error for a regression tree (for
a number of values of tree size), and display the results in a plot. The
plot should show tree size on the horizontal axis and estimated test error
on the vertical axis; the estimated test error for the linear model should
be plotted as a horizontal line (since it isn’t a function of tree size). Your
result should agree with your intuition from (b).
(d) Now repeat (b), but this time find a function for which the regression tree
can be expected to outperform the least squares model.
(e) Now repeat (c), this time using the function from (d).

![[SOLVED] Stat 435 homeworks 1 to 8 solution](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[SOLVED] Csc 360 programming assignment 3 (p3): a simple file system (sfs) solution](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.