1 Introduction
In this homework you will investigate regression with `1 regularization, both implementation techniques and theoretical properties. On the methods side, youll work on coordinate descent (the shooting algorithm), homotopy methods, and [optionally] projected SGD. On the theory side youll derive the largest `1 regularization parameter youll ever need to try, and optionally youll derive the explicit solution to the coordinate minimizers used in coordinate descent, youll investigate what happens with ridge and lasso regression when you have two copies of the same feature, and youll work out the details of the classic picture that explains why `1 regularization leads to sparsity.
1.1 Data Set and Programming Problem Overview
For the experiments, we are generating some artifical data using code in the file setup_problem.py. We are considering the regression setting with the 1-dimensional input space R. An image of the training data, along with the target function (i.e. the Bayes prediction function for the square loss function) is shown in Figure 1 below.
You can examine how the target function and the data were generated by looking at setup_problem.py.
The figure can be reproduced by running the LOAD_PROBLEM branch of the main function.
As you can see, the target function is a highly nonlinear function of the input. To handle this sort of problem with linear hypothesis spaces, we will need to create a set of features that perform nonlinear transforms of the input. A detailed description of the technique we will use can be found in the Jupyter notebook basis-fns.ipynb, included in the zip file.
In this assignment, we are providing you with a function that takes care of the featurization. This is the featurize function, returned by the generate_problem function in setup_problem.py. The generate_problem function also gives the true target function, which has been constructed to be a sparse linear combination of our features. The coefficients of this linear combination are also provided by generate_problem, so you can compare the coefficients of the linear functions you find to the target function coefficients. The generate_problem function also gives you the train and validation sets that you should use.
Figure 1: Training data and target function we will be considering in this assignment.
To get familiar with using the data, and perhaps to learn some techniques, its recommended that you work through the main() function of the include file ridge_regression.py. Youll go through the following steps (on your own no need to submit):
- Load the problem from disk into memory with load_problem.
- Use the featurize function to map from a one-dimensional input space to a d-dimensional feature space.
- Visualize the design matrix of the featurized data. (All entries are binary, so we will not do any data normalization or standardization in this problem, though you may experiment with that on your own.)
- Take a look at the class RidgeRegression. Here weve implemented our own RidgeRegression using the general purpose optimizer provided by scipy.optimize. This is primarily to introduce you to the sklearn framework, if you are not already familiar with it. It can help with hyperparameter tuning, as we will see shortly.
- Take a look at compare_our_ridge_with_sklearn. In this function, we want to get some evidence that our implementation is correct, so we compare to sklearns ridge regression. Comparing the outputs of two implementations is not always trivial often the objective functions are slightly different, so you may need to think a bit about how to compare the results. In this case, sklearn has total square loss rather than average square loss, so we needed to account for that. In this case, we get an almost exact match with sklearn. This is because ridge regression is a rather easy objective function to optimize. You may not get as exact a match for other objective functions, even if both methods are correct.
- Next take a look at do_grid_search, in which we demonstrate how to take advantage of the fact that weve wrapped our ridge regression in an sklearn Estimator to do hyperparameter tuning. Its a little tricky to get GridSearchCV to use the train/test split that you want, but
an approach is demonstrated in this function. In the line assigning the param_grid variable, you can see my attempts at doing hyperparameter search on a different problem. Below you will be modifying this (or using some other method, if you prefer) to find the optimal L2 regularization parameter for the data provided.
- Next is some code to plot the results of the hyperparameter search.
- Next we want to visualize some prediction functions. We plotted the target function, along with several prediction functions corresponding to different regularization parameters, as functions of the original input space R, along with the training data. Next we visualize the coefficients of each feature with bar charts. Take note of the scale of the y-axis, as they may vary substantially, buy default.
2 Ridge Regression
In the problems below, you do not need to implement ridge regression. You may use any of the code provided in the assignment, or you may use other packages. However, your results must correspond to the ridge regression objective function that we use, namely
.
- Run ridge regression on the provided training dataset. Choose the that minimizes the empirical risk (i.e. the average square loss) on the validation set. Include a table of the parameter values you tried and the validation performance for each. Also include a plot of the results.
- Now we want to visualize the prediction functions. On the same axes, plot the following: the training data, the target function, an unregularized least squares fit (still using the featurized data), and the prediction function chosen in the previous problem. Next, along the lines of the bar charts produced by the code in compare_parameter_vectors, visualize the coefficients for each of the prediction functions plotted, including the target function. Describe the patterns, including the scale of the coefficients, as well as which coefficients have the most weight.
- For the chosen , examine the model coefficients. For ridge regression, we dont expect any parameters to be exactly 0. However, lets investigate whether we can predict the sparsity pattern of the true parameters (i.e. which parameters are 0 and which are nonzero) by thresholding the parameter estimates we get from ridge regression. Well predict that wi = 0 if |wi| < and wi 6= 0 Give the confusion matrix for = 106,103,101, and any other thresholds you would like to try.
3 Coordinate Descent for Lasso (a.k.a. The Shooting algorithm)
The Lasso optimization problem can be formulated as1
m w argminX(hw(xi) yi)2 + kwk1, wRd i=1
where hw(x) = wTx, and . Note that to align with Murpys formulation below, and for historical reasons, we are using the total square loss, rather than the average square loss, in the objective function.
Since the `1-regularization term in the objective function is non-differentiable, its not immediately clear how gradient descent or SGD could be used to solve this optimization problem directly. (In fact, as well see in the next homework on SVMs, we can use subgradient methods when the objective function is not differentiable, in addition to the two methods discussed in this homework assignment.)
Another approach to solving optimization problems is coordinate descent, in which at each step we optimize over one component of the unknown parameter vector, fixing all other components. The descent path so obtained is a sequence of steps, each of which is parallel to a coordinate axis in Rd, hence the name. It turns out that for the Lasso optimization problem, we can find a closed form solution for optimization over a single component fixing all other components. This gives us the following algorithm, known as the shooting algorithm:
(Source: Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.) The soft thresholding function is defined as
soft(a,) = sign(a)(|a| )+ ,
for any a, R.
NOTE: Algorithm 13.1 does not account for the case that aj = cj = 0, which occurs when the jth column of X is identically 0. One can either eliminate the column (as it cannot possibly help the solution), or you can set wj = 0 in that case since it is, as you can easily verify, the coordinate minimizer. Note also that Murphy is suggesting to initialize the optimization with the
1
ridge regession solution. Although theoretically this is not necessary (with exact computations and enough time, coordinate descent will converge for lasso from any starting point), in practice its helpful to start as close to the solution as were able.
There are a few tricks that can make selecting the hyperparameter easier and faster. First, as well see in a later problem, you can show that for any 2kXT(y y)k, the estimated weight vector w is entirely zero, where y is the mean of values in the vector y, and k k is the infinity norm (or supremum norm), which is the maximum over the absolute values of the components of a vector. Thus we need to search for an optimal in [0,max], where max = 2kXT(yy)k. (Note: This expression for max assumes we have an unregularized bias term in our model. That is, our decision functions are of the form hw,b(x) = wTx + b. In our the experiments, we do not have an unregularized bias term, so we should use max = 2kXTyk.)
The second trick is to use the fact that when and 0 are close, the corresponding solutions w() and w(0) are also close. Start with = max, for which we know w(max) = 0. You can run the optimization anyway, and initialize the optimization at w = 0. Next, is reduced (e.g. by a constant factor close to 1), and the optimization problem is solved using the previous optimal point as the starting point. This is called warm starting the optimization. The technique of computing a set of solutions for a chain of nearby s is called a continuation or homotopy method. The resulting set of parameter values w() as ranges over [0,max ] is known as a regularization path.
3.1 Experiments with the Shooting Algorithm
- The algorithm as described above is not ready for a large dataset (at least if it has being implemented in Python) because of the implied loop in the summation signs for the expressions for aj and cj. Give an expression for computing aj and cj using matrix and vector operations, without explicit loops. This is called vectorization and can lead to dramatic speedup when implemented in languages such as Python, Matlab, and R. Write your expressions using X, w, y = (y1,,yn)T (the column vector of responses), Xj (the jth column of X, represented as a column matrix), and wj (the jth coordinate of w a scalar).
- Write a function that computes the Lasso solution for a given using the shooting algorithm described above. For convergence criteria, continue coordinate descent until a pass through the coordinates reduces the objective function by less than 108, or you have taken 1000 passes through the coordinates. Compare performance of cyclic coordinate descent to randomized coordinate descent, where in each round we pass through the coordinates in a different random order (for your choices of ). Compare also the solutions attained (following the convergence criteria above) for starting at 0 versus starting at the ridge regression solution suggested by Murphy (again, for your choices of ). If you like, you may adjust the convergence criteria to try to attain better results (or the same results faster).
- Run your best Lasso configuration on the training dataset provided, and select the that minimizes the square error on the validation set. Include a table of the parameter values you tried and the validation performance for each. Also include a plot of these results. Include also a plot of the prediction functions, just as in the ridge regression section, but this time add the best performing Lasso prediction function and remove the unregularized least squares fit. Similarly, add the lasso coefficients to the bar charts of coefficients generated in
the ridge regression setting. Comment on the results, with particular attention to parameter sparsity and how the ridge and lasso solutions compare. Whats the best model you found, and whats its validation performance?
- Implement the homotopy method described above. Compute the Lasso solution for (at least) the regularization parameters in the set. Plot the results (average validation loss vs ).
- [Optional] Note that the data in Figure 1 is almost entirely nonnegative. Since we dont have an unregularized bias term, we have pay for this offset using our penalized parameters. Note also that max would decrease significantly if the y values were 0 centered (using the training data, of course), or if we included an unregularized bias term. Experiment with one or both of these approaches, for both and lasso and ridge regression, and report your findings.
3.2 [Optional] Deriving the Coordinate Minimizer for Lasso
This problem is to derive the expressions for the coordinate minimizers used in the Shooting algorithm. This is often derived using subgradients (slide 15), but here we will take a bare hands approach (which is essentially equivalent).
In each step of the shooting algorithm, we would like to find the wj minimizing
,
where weve written xij for the jth entry of the vector xi. This function is convex in wj. The only thing keeping f from being differentiable is the term with |wj|. So f is differentiable everywhere except wj = 0. Well break this problem into 3 cases: wj > 0, wj < 0, and wj = 0. In the first two cases, we can simply differentiate f w.r.t. wj to get optimality conditions. For the last case, well use the fact that since f : R R is convex, 0 is a minimizer of f iff
and .
This is a special case of the optimality conditions described in slide 6 here, where now the direction v is simply taken to be the scalars 1 and 1, respectively.
- First lets get a trivial case out of the way. If xij = 0 for i = 1,,n, what is the coordinate minimizer wj? In the remaining questions below, you may assume that.
- Give an expression for the derivative f(wj) for wj 6= 0. It will be convenient to write your expression in terms of the following definitions:
1 wj > 0
sign(wj) | := | 0 wj = 01 wj < 0 | |
aj | := | n2Xx2iji=1 | |
cj | := | n X X | |
2 xij yi wkxik.i=1 k6=j |
- If wj > 0 and minimizes f, show that. Similarly, if wj < 0 and minimizes f, show that. Give conditions on cj that imply that a minimizer wj is positive and conditions for which a minimizer wj is negative.
- Derive expressions for the two one-sided derivatives at f(0), and show that cj [,] implies that wj = 0 is a minimizer.
- Putting together the preceding results, we conclude the following:
Show that this is equivalent to the expression given in 3.
4 Lasso Properties
4.1 Deriving max
In this problem we will derive an expression for max. For the first three parts, use the Lasso objective function excluding the bias term i.e,. We will show that for any 2kXTyk, the estimated weight vector w is entirely zero, where kk is the infinity norm (or supremum norm), which is the maximum absolute value of any component of the vector.
- The one-sided directional derivative of f(x) at x in the direction v is defined as:
Compute J0(0;v). That is, compute the one-sided directional derivative of J(w) at w = 0 in the direction v. [Hint: the result should be in terms of X,y,, and v.]
- Since the Lasso objective is convex, w is a minimizer of J(w) if and only if the directional derivative J0(w;v) 0 for all v 6= 0. Show that for any v 6= 0, we have J0(0;v) 0 if and only if C, for some C that depends on X,y, and v. You should have an explicit expression for C.
- In the previous problem, we get a different lower bound on for each choice of v. Show that the maximum of these lower bounds on is max = 2kXTyk. Conclude that w = 0 is a minimizer of J(w) if and only if 2kXTyk.
- [Optional] Let, where 1 Rn is a column vector of 1s.
Let y be the mean of values in the vector y. Show that (w,b) = (0,y) is a minimizer of J(w,b) if and only if max = 2kXT(y y)k.
4.2 Feature Correlation
In this problem, we will examine and compare the behavior of the Lasso and ridge regression in the case of an exactly repeated feature. That is, consider the design matrix X Rmd, where Xi = Xj for some i and j, where Xi is the ith column of X. We will see that ridge regression divides the weight equally among identical features, while Lasso divides the weight arbitrarily. In an optional part to this problem, we will consider what changes when Xi and Xj are highly correlated (e.g. exactly the same except for some small random noise) rather than exactly the same.
- Without loss of generality, assume the first two colums of X are our repeated features. Partition X and as follows:
We can write the Lasso objective function as:
J() =kX yk22 + kk1
=kx11 + x22 + Xrr yk22 + |1| + |2| + krk1 With repeated features, there will be multiple minimizers of J(). Suppose that
is a minimizer of J(). Give conditions on c and d such that is also a minimizer of J(). [Hint: First show that a and b must have the same sign, or at least one of them is zero.
Then, using this result, rewrite the optimization problem to derive a relation between a and b.]
- Using the same notation as the previous problem, suppose
minimizes the ridge regression objective function. What is the relationship between a and b, and why?
- [Optional] What do you think would happen with Lasso and ridge when Xi and Xj are highly correlated, but not exactly the same. You may investigate this experimentally or theoretically.
5 [Optional] The Ellipsoids in the `1/`2 regularization picture
Recall the famous picture purporting to explain why `1 regularization leads to sparsity, while `2 regularization does not. Heres the instance from Hastie et als The Elements of Statistical Learning:
(While Hastie et al. use for the parameters, well continue to use w.)
In this problem well show that the level sets of the empirical risk are indeed ellipsoids centered at the empirical risk minimizer w.
Consider linear prediction functions of the form x 7 wTx. Then the empirical risk for f(x) = wTx under the square loss is
.
- [Optional] Let. Show that w has empirical risk given by
- [Optional] Show that for any w we have
.
Note that the RHS (i.e. right hand side) has one term thats quadratic in w and one term thats independent of w. In particular, the RHS does not have any term thats linear in w. On the LHS (i.e. left hand side), we have. After expanding this out, youll have terms that are quadratic, linear, and constant in w. Completing the square is the tool for rearranging an expression to get rid of the linear terms. The following completing the square identity is easy to verify just by multiplying out the expressions on the RHS:
- [Optional] Using the expression derived for Rn(w) in 2, give a very short proof that w = is the empirical risk minimizer. That is:
w = argminRn(w).
w
Hint: Note that XTX is positive semidefinite and, by definition, a symmetric matrix M is positive semidefinite iff for all x Rd, xTMx 0.
- [Optional] Give an expression for the set of w for which the empirical risk exceeds the minimum empirical risk Rn(w) by an amount c > 0. If X is full rank, then XTX is positive definite, and this set is an ellipse what is its center?
6 [Optional] Projected SGD via Variable Splitting
In this question, we consider another general technique that can be used on the Lasso problem. We first use the variable splitting method to transform the Lasso problem to a differentiable problem with linear inequality constraints, and then we can apply a variant of SGD.
Representing the unknown vector as a difference of two non-negative vectors + and , the
d d
`1-norm of is given by Xi+ + Xi. Thus, the optimization problem can be written as
i=1 i=1
m d d
(+,) = argmin X(h+,(xi) yi)2 + Xi+ + Xi
+,Rd i=1 i=1 i=1 such that + 0 and 0,
where h+,(x) = (+ )Tx. The original parameter can then be estimated as = (+ ).
This is a convex optimization problem with a differentiable objective and linear inequality constraints. We can approach this problem using projected stochastic gradient descent, as discussed in lecture. Here, after taking our stochastic gradient step, we project the result back into the feasible set by setting any negative components of + and to zero.
- [Optional] Implement projected SGD to solve the above optimization problem for the same s as used with the shooting algorithm. Since the two optimization algorithms should find essentially the same solutions, you can check the algorithms against each other. Report the differences in validation loss for each between the two optimization methods. (You can make a table or plot the differences.)
- [Optional] Choose the that gives the best performance on the validation set. Describe the solution w in term of its sparsity. How does the sparsity compare to the solution from the shooting algorithm?
Reviews
There are no reviews yet.