Lecture 8: Linear regression (2) CS 189 (CDSS offering)
2022/02/04
Todays lecture
Copyright By Assignmentchef assignmentchef
Back to linear regression!
Why? We have some new tools to play around with: MVGs and MAP estimation
Remember: we already motivated several linear regression formulations, such as least squares, from the perspective of MLE
Today, we will introduce two more formulations from the perspective of MAP estimation: ridge regression and LASSO regression
Time permitting, we will introduce the concept of (polynomial) featurization 2
Recap: MAP estimation
what is the MAP estimate if p(!) = (!; 0, 2I)? arggin logp yi lxi 0 logp o
argoninItlo lyi lxi ol E EII
g p set it directly
this is an example of a regularizer typically, something we add to the loss function that doesnt depend on the data
regularizers are crucial for combating overfitting
Ridge regression
We can apply this idea ( p(!) = (!; 0, 2I), or equivalently, #2-regularization) to least squares linear regression
What will be the resulting learning problem?
Remember the least squares linear regression problem: arg min Xw # y2
Here, we have ! = w
So adding #2-regularization results in: arg min Xw # y2 + $w2
This is the ridge regression problem setup
Solving ridge regression
again I Xw yl tall will
argmain wTXTXw 2 yTX w t XwTw argumin WT XX XI w
now lets take the gradient and set equal to zero:
Dw 2 ATX XI Wmap 2XTy O
or whip XX XI XTy 5
objective:
lets rewrite this slightly:
The ridge solution vs the least squares solution
The ridge solution differs in adding a $I term inside the inverted expression
What does this do?
Intuitively, it makes the resulting solution smaller in magnitude, which makes sense
Numerically, it can fix underdetermined (or ill conditioned) problems!
Recall that the least squares solution we found needs X$X to be invertible, and this is not the case when the columns of X are not linearly independent
Adding $I for any $ > 0 guarantees invertibility, and even relatively small $ can generally make the problem better conditioned (easier for a computer to solve)
Selecting $
Think for a minute: can we choose $ the same way we learn w? That is, can we
We cant, because this would always set $ to 0!
$ is an example of a hyperparameter a parameter that is not learned, but
instead we have to set it ourselves
Learning hyperparameters with the same objective often leads to overfitting
We will talk more about how to select hyperparameters in a few weeks
do something like arg min Xw # y2 + $w2? w, $%0
LASSO regression
what if we instead choose p(!i) = Laplace(!i; 0, b) for all i?
poi Eexp 191 eogplo arggax E logpl.yj1xj 07tlogp
argminE.llxw yllitsEl
wfilogp10il
w Frise pop
we can replace with X
how did I getthis 8
The LASSO regression solution
LASSO corresponds to #1-regularization, which tends to induce sparse solutions
LASSO does not have an analytical (set the gradient to zero) solution
Most commonly, LASSO is solved with proximal gradient methods, which are covered in optimization courses
How powerful are linear models anyway? Or, the importance of featurization
Linear models, by themselves, might seem not that useful
However, they can be, if we use the right set of features
That is, instead of using x directly, we work with a featurization %(x) that may be better suited to the data
Everything else stays the same just replace x with %(x) everywhere
We will talk extensively in this class about different featurizations, both hand
designed (when we talk about kernels) and learned (neural networks)
A classic example of featurization
boundary d
(this example is for linear classification, not regression)
Advanced Machine Learning Specialization, Coursera
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.