CS 189 Introduction to Machine Learning Fall 2017
This homework is due Friday, September 15 at 12pm noon. 1 Getting Started
HW3
You may typeset your homework in latex or submit neatly handwritten and scanned solutions. Please make sure to start each question on a new page, as grading (with Gradescope) is much easier that way! Deliverables:
1. Submit a PDF of your writeup to assignment on Gradescope, HW[n] Write-Up 2. Submit all code needed to reproduce your results, HW[n] Code.
3. Submit your test set evaluation results, HW[n] Test Set.
After youve submitted your homework, be sure to watch out for the self-grade form.
(a) Before you start your homework, write down your team. Who else did you work with on this homework? List names and email addresses. In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
(b) Please copy the following statement and sign next to it:
I certify that all solutions are entirely in my words and that I have not looked at another students solutions. I have credited all external sources in this write up.
CS 189, Fall 2017, HW3 1
Throughout the homework, note that
Z R implies that Z is a scalar;
Z Rd is equivalent to saying that we have dimensions Z ;
d1
Z Rnd is equivalent to saying that Z is a matrix with dimensions Z .
WhenwesaythatZ,Z R,wemeanthatZRandZ R. 2 Probabilistic Model of Linear Regression
nd
Both ordinary least squares and ridge regression have interpretations from a probabilistic stand- point. In particular, assuming a generative model for our data and a particular noise distribution, we will derive least squares and ridge regression as the maximum likelihood and maximum a- posteriori parameter estimates, respectively. This problem will walk you through a few steps to do that. (Along with some side digressions to make sure you get a better intuition for ML and MAP estimation.)
- (a) Assume that X and Y are both one-dimensional random variables, i.e. X,Y R. Assume an affinemodelbetweenX andY:Y =Xw+b+Z,wherew,bR,andZN(0,1)isastandard normal (Gaussian) random variable. Assume w,b are not random. What is the conditional distribution of Y given X ?
- (b) Given n points of training data {(X1,Y1),(X2,Y2),,(Xn,Yn)} generated in an iid fashion by the probabilistic setting in the previous part, derive the maximum likelihood estimator for w,b from this training data.
- (c) Now, we are going to change the probabilistic model slightly in two ways. There is no more b and Z has a different distribution. So assume the linear model between X and Y is given by Y = Xw + Z, where Z U[0.5,0.5] is a continuous random variable uniformly distributed between 0.5 and 0.5. Assume w is not random. What is the conditional distribution of Y given X?
- (d) Given n points of training data {(X1,Y1),(X2,Y2), ,(Xn,Yn)} generated in an iid fashion in the setting of the previous part, derive the maximum likelihood estimator of w. (Note that this estimate may not be unique; you may report a set of values, or feel free to break ties in whatever fashion and report one value within this set. )
- (e) TakethemodelY=Xw+b+Z,whereZU[0.5,0.5]isacontinuousrandomvariable uniformly distributed between 0.5 and 0.5. Use a computer to simulate n training samples {(X1,Y1),(X2,Y2), ,(Xn,Yn)} and illustrate what the likelihood of the data looks like in the (w, b) model parameter space after n = 5, 25, 125, 625 training samples. Qualitatively describe what is happening as n gets large?
(Note that you have considerable design freedom in this problem part. You get to choose how you draw the Xi as well as what true value (w,b) you want to illustrate. You have total freedom in using python libraries for this problem part. No restrictions.)
CS 189, Fall 2017, HW3 2
- (f) (One-dimensional Ridge Regression) Now, let us return to the case of Gaussian noise. Given npointsoftrainingdata{(X1,Y1),(X2,Y2),,(Xn,Yn)}generatedaccordingtoYi =XiW+Zi, where Zi N(0,1) are iid standard normal random variables. Assume W N(0,2) is also a standard normal and is independent of both the Zi and the Xi. Use Bayes Theorem to derive the posterior distribution of W given the training data. What is the mean of the posterior distribution of W given the data?
- (g) Consider n training data points {(X1,Y1),(X2,Y2), ,(Xn,Yn)} generated according to Yi = wTXi+Zi whereYi R,w,Xi Rd withwfixed,andZi N(0,1)iidstandardnormalrandom variables. Derive why the maximum likelihood estimator for w is the solution to a least squares problem.
- (h) (Multi-dimensional ridge regression) Consider the setup of the previous part: Yi = W T Xi + Zi, where Yi R,W,Xi Rd, and Zi N(0,1) iid standard normal random variables. Now, however, we have a prior for the vector W (now a random variable): Wj N(0,2) are iid for
j = 1,2,,d. Derive the posterior distribution of W given all the Xi,Yi pairs. What is the mean of the posterior distribution of the vector W ?
- (i) Consider d = 2 and the setting of the previous part. Use a computer to simulate and illustrate what the a-posteriori probability looks like for the W model parameter space after n = 5,25,125 training samples for different values of 2.
(Note that you have considerable design freedom in this problem part. You have total freedom in using python libraries for this problem part. No restrictions. Either contour plots or 3d plots of the a-posteriori probability are fine. You dont have to label the contours with values if the story is clear from the plots visually.)
3 Simple Bias-Variance Tradeoff
Consider a random variable X, which has unknown mean and unknown variance 2. Given n iid realizations of training samples X1 = x1,X2 = x2,,Xn = xn from the random variable, we wish to estimate the mean of X . We will call our estimate the random variable X with mean . There are a few ways we can estimate given the realizations of the n samples:
1. Average the n samples: x1+x2++xn . n
2. Average the n samples and one sample of 0: x1+x2++xn . n+1
3. Average the n samples and n0 samples of 0: x1+x2++xn . n+n0
4. Ignore the samples: just return 0.
In the parts of this question, we will measure the bias and variance of each of our estimators. The
bias is defined as
and the variance is defined as
E [ X ] Var[X ].
CS 189, Fall 2017, HW3
3
(a) What is the bias of each of the four estimators above?
- (b) What is the variance of each of the four estimators above?
- (c) Hopefully you see that there is a trade-off. As we add more copies of 0 to our estimator, the bias increases and the variance decreases. It is a common mistake to assume that an unbiased estimator is always best. Lets explore this a bit further. Define the expected total error as the sum of the variance and the square of the bias. Compute the expected total error for each of the estimators above.
- (d) Argue that all four estimators are really just special cases of the third estimator (with the hyperparameter n0).
- (e) Say that n0 = n. Find the setting for that would minimize the expected total error, assuming you secretly knew and . Your answer will depend on , , and n.
- (f) For this part, lets assume that we had some reason to believe that should be small (close to 0) and should be large. In this case, what happens to the expression in the previous part?
- (g) In the previous part, we assumed there was reason to believe that should be small. Now lets assume that we have reason to believe that is not necessarily small, but should be close to some fixed value 0. In terms of X and 0, how can we define a new random variable X such that X is expected to have a small mean?
- (h) Draw a connection between in this problem and the regularization parameter in the ridge- regression version of least-squares. What does this problem suggest about choosing a reg- ularization coefficient and handling our data-sets so that regularization is most effective? This is an open-ended question, so do not get too hung up on it.
4 Estimation in linear regression
In linear regression, we estimate a vector y Rn by using the columns of a feature matrix A Rnd . Assume that the number of training samples n d and that A has full column rank. Let us now show how well we can estimate the underlying model as the number of training samples grows.
Assume that the true underlying model for our noisy training observations is given by Y = Ax +W ,
y withW Rn havingiidWj N (0,2)representingtherandomnoiseintheobservationY. Here, thex Rd issomethingarbitraryandnotrandom. AfterobtainingX =argminxYAx2,we would like to bound the error AX y2, which is the prediction error incurred based on the
specific n training samples we obtained.
Initially, it might seem that getting a good estimate of this prediction error is hopeless since the training datas A matrix is involved in some complicated way in the estimate. The point of this problem is to show you that this is not the case and we can actually understand this.
(a) Using the standard closed form solution to the ordinary least squares problem, show that AX y2 = A(AA)1AW 2.
CS 189, Fall 2017, HW3
4
(b) Given the reduced singular value decomposition A = UV (with U Rnd, Rdd, and V Rdd. The reduced SVD just ignores all the singular vectors that correspond to directions that are in the nullspace of A since A is a tall skinny matrix.), show that we have
A X y 2 2 = U W 2 2 .
(Hint: Recall that the standard Euclidean l2 norm is unitarily invariant.)
(c) If W0 is a vector with i.i.d standard Gaussians, recall that aW0 N (0,a2). Use this fact
(d)
Notice that this kind of scaling is precisely how the average squared error for simple averaging scales when we have multiple iid samples of a random variable and we want to estimate its mean. This is why we often think about ordinary least squares as just being a kind of general- ized averaging. However, the dimension of the parameters being estimated also matters.
Let us now try to answer another basic question. Let us say we have a true linear model, but we try to estimate it using a polynomial of degree D. Clearly, this is overkill, but what do we lose?
Given scalar samples {xi,Yi}ni=1, let us say that the underlying model is a noisy linear model, i.e. Yi = mxi +c+Wi. Let us, however, perform polynomial regression: we form the matrix A by using D + 1 polynomial features (including the constant) of the distinct sampling points {xi}ni=1. How many samples n do we need to ensure that our average squared error is bounded by ? Your answer should be expressed as a function of D, 2, and .
Conclude that as we increase model complexity, we require a proportionally larger num- ber of samples for accurate prediction.
At this point, you are encouraged to take a fresh look at the discussion worksheet which is, in effect, a continuation of this problem to understand the ramifications of the bias/variance tradeoff in the context of ordinary least-squares regression.
Try the above problem out by yourself, by setting m = 1, c = 1, and sample n points {xi}ni=1
uniformly from the interval [1, 1]. Generate Yi = mxi + c + Wi with Wi representing standard
Gaussian noise. Fit a D degree polynomial to this data and show how the average error
1AX y2 scales as a function of both D and n. You may show separate plots for the two n2
scalings. It may also be helpful to average over multiple realizations of the noise (or to plot point clouds) so that you obtain smooth curves.
(For this part, any and all python libraries are allowed.)
Robotic Learning of Controls from Demonstrations and Images
(e)
5
to show that
1 E A X y 2 2 = 2 d . nn
Huey, a home robot, is learning to retrieve objects from a cupboard, as shown in Fig. 1. The goal is to push obstacle objects out of the way to expose a goal object. Hueys robot trainer, Anne, provides demonstrations via tele-operation.
CS 189, Fall 2017, HW3 5
During a demonstration, Huey records the RGB images of the scene for each timestep, x0,x1,,xT , where xi R30x30x3 and the controls for his mobile base, u0,u1,,uT , where ui R3. The controls correspond to making small changes in the 3D pose (i.e. translation and rotation) of his body. Examples of the data are shown in the figure.
Under an assumption (sometimes called the Markovian assumption) that all that matters for the current control is the current image, Huey can try to learn a linear policy (where R2700x3) which linearly maps image states to controls (i.e. x = u). We will now explore how Huey can recover this policy using linear regression. Note please use numpy and numpy.linalg to complete this assignment.
A) Robot Performing Task B) Dataset
Figure 1: A) Huey trying to retrieve a mustard bottle. An example RGB image of the workspace taken from his head mounted camera is shown in the orange box. The angle of the view gives Huey and eye-in-hand perspective of the cupboard he is reaching into. B) A scatter plot of the 3D control vectors, or u labels. Notice that each coordinate of the label lies within the range of [1, 1] for the change in position. Example images, or states x, are shown for some of the corresponding control points. The correspondence is indicated by the blue lines.
(a) Load the n training examples from x train.p and compose the matrix X, where X Rnx2700. Note, you will need to flatten the images to reduce them to a single vector. The flattened image vector will be denoted by x (where x R2700x1). Next, load the n examples from y train.p and compose the matrix U, where U Rnx3. Try to perform ordinary least squares to solve:
minX UF2
to learn the policy R27003. Report what happens as you attempt to do this and explain why.
(b) Now try to perform ridge regression:
min||X U||2 +||||2
CS 189, Fall 2017, HW3
6
on the dataset for regularization values = {0.1,1.0,10,100,1000}. Measure the average squared Euclidean distance for the accuracy of the policy on the training data:
1n1T 2 n | | x i u i | | 2
i=0
Report the training error results for each value of .
(c) Next, we are going to try and improve the performance by standardizing the states. For each pixel value in each data point, x, perform the following operation:
x= x 21. 255
Since we know the maximum pixel value is 255, this rescales the data to be between [1,1] . Repeat the previous part and report the average squared training error for each value of .
(d) To better understand how standardizing improved the loss function, we are going to evaluate the condition number k of the optimization, which is defined as
k=m2ax(XTX+I) 2 (XTX+I)
min
or the ratio of the maximum eigenvalue to the minimum eigenvalue of the relevant matrix. For
the regularization value of = 100, report the condition number with the standardization technique applied and without.
(e) Finally, evaluate your learned policy on the new validation data x test.p and y test.p for the different values of . Report the average squared Euclidean loss and qualitatively explain how changing the values of affects the performance in terms of bias and variance.
6 Your Own Question
Write your own question, and provide a thorough solution.
Writing your own problems is a very important way to really learn material. The famous Blooms Taxonomy that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we dont want the same to be true about the highest level.
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
CS 189, Fall 2017, HW3 7
Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You dont have to achieve this every week. But unless you try every week, it probably wont happen ever.
CS 189, Fall 2017, HW3 8
Reviews
There are no reviews yet.