Predictive Analytics Week 7: Linear Methods for Regression I
Predictive Analytics
Week 7: Linear Methods for Regression I
Semester 2, 2018
Discipline of Business Analytics, The University of Sydney Business School
QBUS2820 content structure
1. Statistical and Machine Learning foundations and applications.
2. Advanced regression methods.
3. Classification methods.
4. Time series forecasting.
2/52
Week 7: Linear Methods for Regression I
1. Introduction
2. Variable selection
3. Regularisation methods
4. Discussion
Reading: Chapters 6.1 and 6.2 of ISL.
Exercise questions: Chapter 6.8 of ISL, Q1, Q2, Q3, and Q4.
3/52
Introduction
Linear Methods for Regression
In this lecture we focus again on the linear regression model for
prediction. We move beyond OLS to consider other estimation
methods.
The motivation for studying these methods is that using many
predictors in a linear regression model typically leads to overfitting.
We will therefore accept some bias in order reduce variance.
4/52
Linear regression (review)
Consider the additive error model
Y = f(x) + .
The linear regression model is a special case based on a regression
function of the form
f(x) = 0 + 11 + 22 + . . .+ pxp
5/52
OLS (review)
In the OLS method, we select the coefficient values that minimise
the residual sum of squares
ols = argmin
N
i=1
yi 0 p
j=1
jxij
2
We obtain the formula
ols = (XTX)1XTy.
6/52
MLR model (review)
1. Linearity: if X = x, then
Y = 0 + 11 + . . .+ pxp +
for some population parameters 0, 1, . . . , p and a random
error .
2. The conditional mean of given X is zero, E(|X) = 0.
3. Constant error variance: Var(|X) = 2.
4. Independence: the observations are independent.
5. The distribution of X1, . . . , Xp is arbitrary.
6. There is no perfect multicollinearity (no column of X is a
linear combination of other columns).
7/52
OLS properties (review)
Under Assumptions 1 (the regression function is correctly
specified) and 2 (there are no omitted variables that are correlated
with the predictors), the OLS estimator is unbiased
E(ols) = .
8/52
Why we are not satisfied with OLS?
Prediction accuracy. Low bias (if the linearity assumption is
approximately correct), but potentially high variance. We can
improve performance by setting some coefficients to zero or
shrinking them.
Interpretability. A regression estimated with too many predictors
and high variance is hard or impossible to interpret. In order to
understand the big picture, we are willing to sacrifice some of the
small details.
9/52
Linear model selection and regularisation
Variable selection. Identify a subset of k < p predictors to use.Estimate the model by using OLS on the reduced set of variables.Regularisation (shrinkage). Fit a model involving all the ppredictors, but shrink the coefficients towards zero relative to OLS.Depending on the type of shrinkage, some estimated coefficientsmay be zero, in which case the method also performs variableselection.Dimension reduction. Construct a set of m < p predictors whichare are linear combinations of the original predictors. Fit the modelby OLS on these new predictors.10/52Variable selectionBest subset selection (key concept)The best subset selection method estimates all possible modelsand selects the best one according to a model selection criterion(AIC, BIC, or cross validation).Given p predictors, there are 2p possible models to choose from.11/52Best subset selectionFor example, if p = 3 we would estimate 23 = 8 models:k = 0 : Y = 0 + k = 1 : Y = 0 + 1×1 + Y = 0 + 2×2 + Y = 0 + 3×3 + k = 2 : Y = 0 + 1×1 + 2×2 + Y = 0 + 1×1 + 3×3 + Y = 0 + 2×2 + 3×3 + k = 3 : Y = 0 + 1×1 + 2×2 + 3×3 + 12/52Best subset selectionAlgorithm Best subset selection1: Estimate the null modelM0, which contains only the constant.2: for k = 1, 2, . . . , p do3: Fit all(pk)possible models with exactly k predictors.4: Pick the model with the lowest RSS and call it Mk.5: end for6: Select the best model among M0,M1, . . . ,Mp according tocross validation, AIC, or BIC.13/52Computational considerationsThe best subset method suffers from a problem of combinatorialexplosion, since it requires the estimation of 2p different models.The computational requirement is therefore very high, except inlow dimensions.For example, for p = 30 we would need to fit a little over 1 billionmodels! Best subset selection has a very high computational costand is infeasible in practice for p larger than around 40.14/52Stepwise selectionStepwise selection methods are a family of search algorithms thatfind promising subsets by sequentially adding or removingregressors, dramatically reducing the computational cost comparedto estimating all possible specifications.Conceptually, they are an approximation to best subset selection,not different methods.15/52Forward selectionAlgorithm Forward selection1: Estimate the null modelM0, which contains only the constant.2: for k = 1, 2, . . . , p do3: Fit all the pk+ 1 models that add one predictor toMk1.4: Choose the best of pk+ 1 models in terms of RSS and callit Mk.5: end for6: Select the best model among M0,M1, . . . ,Mp according tocross validation, AIC, or BIC.16/52Backward selectionAlgorithm Backward selection1: Estimate the full model Mp by OLS.2: for k = p 1, . . . , 1, 0 do3: Fit all the k+1 models that delete one predictor fromMk+1.4: Choose the best of the k+ 1 models in terms of RSS and callit Mk.5: end for6: Select the best model among M0,M1, . . . ,Mp according tocross-validation, AIC, or BIC.17/52Stepwise selection Compared to best subset selection, the forward and backwardstepwise algorithms reduce the number of estimations from 2pto 1 + p(p+ 1)/2. For example, for p = 30 the number offitted models is 466. The disadvantage is that the final model selected by stepwiseselection is not guaranteed to optimise any selection criterionamong the 2p possible models.18/52Variable selectionAdvantages Accuracy relative to OLS. It tends to lead to better predictionscompared to estimating a model with all predictors. Interpretability. The final model is a linear regression modelbased a reduced set of predictors.Disadvantages Computational cost. By making binary decisions include or exclude particularvariables, variable selection may exhibit higher variance thanregularisation and dimension reduction approaches.19/52Illustration: Equity Premium Prediction (OLS)Quarterly data from Goyal and Welch (2008).Response: quarterly S&P 500 returns minus treasury bill ratePredictors (lagged by one quarter):1. dp Dividend to price ratio2. dy Dividend yield3. ep Earnings per share4. bm Book-to-market ratio5. ntis Net equity expansion6. tbl Treasury bill rate7. ltr Long term rate of return on US bods8. tms Term spread9. dfy Default yield spread10.dfr Default return spread11.infl Inflation12.ik Investment to capital ratio20/52Illustration: Equity Premium PredictionOLS Regression Results==============================================================================Dep. Variable: ret R-squared: 0.108Model: OLS Adj. R-squared: 0.051Method: Least Squares F-statistic: 1.901Date: Prob (F-statistic): 0.0421Time: Log-Likelihood: -629.21No. Observations: 184 AIC: 1282.Df Residuals: 172 BIC: 1321.Df Model: 11Covariance Type: nonrobust==============================================================================coef std err t P>|t| [95.0% Conf. Int.]
Intercept 26.1369 14.287 1.829 0.069 -2.064 54.337
dp 0.3280 8.247 0.040 0.968 -15.951 16.607
dy 3.3442 7.941 0.421 0.674 -12.330 19.019
ep 0.3133 2.345 0.134 0.894 -4.315 4.942
bm -3.2443 6.719 -0.483 0.630 -16.507 10.018
ntis -46.9566 38.911 -1.207 0.229 -123.762 29.848
tbl -2.8651 20.922 -0.137 0.891 -44.162 38.432
ltr 10.2432 14.468 0.708 0.480 -18.314 38.800
tms 13.1083 11.129 1.178 0.240 -8.859 35.076
dfy -156.8202 213.943 -0.733 0.465 -579.111 265.471
dfr 71.0710 29.099 2.442 0.016 13.634 128.508
infl -36.9489 82.870 -0.446 0.656 -200.521 126.623
ik -208.4868 242.844 -0.859 0.392 -687.824 270.851
==============================================================================
21/52
Illustration: Equity Premium Prediction
We select the following models in the equity premium dataset
based on the AIC:
Best subset selection: (dy, bm, tms, dfr)
Forward selection: (ik, tms, dfr)
Backward selection: (dy, tms, dfr)
22/52
Illustration: Equity Premium Prediction
Table 1: Equity Premium Prediction Results
Train R2 Test R2
OLS 0.108 0.014
Best Subset 0.095 0.038
Forward 0.083 0.042
Backward 0.084 0.060
23/52
Illustration: Equity Premium Prediction (OLS)
24/52
Wrong ways to do variable selection
Adjusted R2. The adjusted R2 has no justification as a model
selection criterion. It does not sufficiently penalise additional
predictors.
Removing statistically insignificant predictors. A statistically
significant coefficient means we can reliably say that it is not
exactly zero. This has almost nothing to do with prediction (see
the regression output slide). Furthermore, there are multiple
testing issues.
25/52
Regularisation methods
Regularisation methods (key concept)
Regularisation or shrinkage methods for linear regression follow
the general framework of empirical risk minimisation:
= argmin
[
N
i=1
L(yi, f(xi;))
]
+ C(),
Here, the loss function is the squared loss and the complexity
function will be the norm of the vector of regression coefficients .
The choice of norm leads to different regularisation properties.
26/52
Ridge regression (key concept)
The ridge regression method solves the penalised estimation
problem
ridge = argmin
N
i=1
yi 0 p
j=1
ixij
2 + p
j=1
2j
,
for a tuning parameter .
The penalty term ||||22 has the effect of shrinking the coefficients
relative to OLS. We refer to this procedure as `2 regularisation.
27/52
Ridge regression
The ridge estimator has an equivalent formulation as a constrained
minimisation problem
ridge = argmin
N
i=1
yi 0 p
j=1
ixij
2
subject to
p
j=1
2j < t.for some t > 0.
28/52
Practical details
1. The hyperparameters or t control the amount of shrinkage.
There is an one-to-one connection between them.
2. We do not penalise the intercept. In practice, we center the
response and the predictors before computing the solution and
estimate the intercept as 0 = y.
3. The method is not invariant on the scale of the inputs. We
standardise the predictors before solving the minimisation
problem.
29/52
Ridge regression
We can write the minimisation problem in matrix form as
min
(y X)T (y X) + T.
Relying on the same techniques that we used to derive the OLS
estimator, we can show the ridge estimator has the formula
ridge = (XTX + I)1XTy
30/52
Orthonormal vectors
We say that two vectors u and v are orthonormal when
||u|| =
uTu = 1, ||v|| =
vTv = 1, and uTv = 0.
We say that the design matrix X is orthonormal when all its
columns are orthonormal.
31/52
Ridge regression: shrinkage (key concept)
If the design matrix X was orthonormal, the ridge estimate would
just a scaled version of the OLS estimate
ridge = (I + I)1XTy =
1
1 +
OLS
In a more general situation, we can say that the ridge regression
method will shrink together the coefficients of correlated predictors.
32/52
Ridge regression
We define the ridge shrinkage factor as
s() =
||ridge||2
||ols||2
,
for a given or t.
The next slide illustrates the effect of varying the shrinkage factor
on the estimated parameters.
33/52
Ridge coefficient profiles (equity premium data)
34/52
Selecting
The ridge regression method leads to a range of models for
different values of . We select by cross validation or generalised
cross validation.
GCV is computationally convenient for this model.
35/52
Selecting (equity premium data)
36/52
The Lasso
The Lasso (least absolute shrinkage and selection operator)
method solves the penalised estimation problem
lasso = argmin
N
i=1
yi 0 p
j=1
ixij
2 + p
j=1
|j |
,
for a tuning parameter .
The Lasso therefore performs `1 regularisation.
37/52
The Lasso
The equivalent formulation of the lasso as a constrained
minimisation problem is
lasso =argmin
N
i=1
yi 0 p
j=1
ixij
2
subject to
p
j=1
|j | < t.for some t > 0.
38/52
The Lasso: shrinkage and variable selection (key concept)
Shrinkage. As with ridge regression, the lasso shrinks the
coefficients towards zero. However, the nature of this shrinkage is
different, as we will see below.
Variable selection. In addition to shrinkage, the lasso also
performs variable selection. With sufficiently large, some
estimated coefficients will be exactly zero, leading to sparse
models. This is a key difference from ridge.
39/52
The Lasso: variable selection property
Estimation picture for the lasso (left) and ridge regression (right):
40/52
Practical details
1. We select the tuning parameter by cross validation.
2. As with ridge, we center and standardise the predictors before
computing the solution.
3. There is no closed form solution for the lasso coefficients.
Computing the lasso solution is a quadratic programming
problem.
4. There are efficient algorithms for computing an entire path of
solutions for a range of values.
41/52
The Lasso
We define the shrinkage factor for a given value of (or t) as
s() =
p
j=1
lassoj p
j=1
olsj .
The next slide illustrates the effect of varying the shrinkage factor
on the estimated parameters.
42/52
Lasso coefficient profiles (equity premium data)
43/52
Model selection for the equity premium data
44/52
Discussion
Subset selection, ridge, and lasso: comparison in the orthonor-
mal case (optional)
Estimator Formula
Best subset (size k) j I(|j | > |(k)|)
Ridge j/(1 + )
Lasso sign(j)(|j | )+
Estimators of j in the case of orthonormal columns of X.
45/52
Ridge and Lasso: comparison in the orthonormal case (op-
tional)
1.5 0.5 0.0 0.5 1.0 1.5
1
.5
0
.5
0
.5
1
.5
C
o
e
ff
ic
ie
n
t
E
s
ti
m
a
te
Ridge
Least Squares
1.5 0.5 0.0 0.5 1.0 1.5
1
.5
0
.5
0
.5
1
.5
C
o
e
ff
ic
ie
n
t
E
s
ti
m
a
te
Lasso
Least Squares
yjyj
46/52
Which method to use?
Recall the no free lunch theorem: neither ridge regression or
the lasso universally outperform the other. The choice of
method should be data driven.
In general terms, we can expect the lasso to perform better
when a small subset of predictors have important coefficients,
while the remaining predictors having small or zero
coefficients (sparse problems).
Ridge regression will tend to perform better when the
predictors all have similar importance.
The lasso may have better interpretability since it can lead to
a sparse solution.
47/52
Elastic Net
The elastic net is a compromise between ridge regression and the
lasso:
EN = argmin
N
i=1
yi 0 p
j=1
ixij
2 + p
j=1
(
2j + (1 )|j |
)
,
for 0 and 0 < < 1.The elastic net performs variable selection like the lasso, andshrinks together the coefficients of correlated predictors like ridgeregression.48/52Illustration: equity premium dataEstimated coefficients (tuning parameters selected by leave-one-outCV)OLS Ridge Lasso ENdp 0.566 0.159 0.000 0.111dy 0.602 0.197 0.000 0.153ep 0.942 0.116 0.000 0.048bm -1.055 0.033 0.000 0.000ntis -0.276 -0.067 -0.000 -0.000tbl -0.489 -0.248 -0.000 -0.178ltr 0.597 0.186 0.000 0.124tms 0.762 0.286 0.161 0.239dfy 0.145 0.031 0.000 0.000dfr 1.570 0.377 0.131 0.294infl -0.202 -0.214 -0.000 -0.150ik -0.408 -0.318 -0.422 -0.28249/52Illustration: equity premium dataPrediction resultsTrain R2 Test R2OLS 0.108 0.014Ridge 0.054 0.033Lasso 0.033 0.011Elastic Net 0.050 0.02950/52Comparison with variable selectionRegularisation methods have two important advantages overvariable selection.1. They are continuous procedures, generally leading to lowervariance.2. The computational cost is not much larger than OLS.51/52Review questions What is best subset selection? What are stepwise methods? What are the advantages and disadvantages of variableselection? What are the penalty terms in the ridge and Lasso methods? What are the key differences in type of shrinkage between theridge and Lasso methods? In what situations would we expect the ridge or lasso methodsto perform better?52/52IntroductionVariable selectionRegularisation methodsDiscussion
Reviews
There are no reviews yet.