CS 189 Introduction to Machine Learning Fall 2017
This homework is due Monday, October 30 at 10pm. 1 Getting Started
HW9
You may typeset your homework in latex or submit neatly handwritten and scanned solutions. Please make sure to start each question on a new page, as grading (with Gradescope) is much easier that way! Deliverables:
- Submit a PDF of your writeup to assignment on Gradescope, HW[n] Write-Up
- Submit all code needed to reproduce your results, HW[n] Code.
- Submit your test set evaluation results, HW[n] Test Set.
After youve submitted your homework, be sure to watch out for the self-grade form.
- (a) Before you start your homework, write down your team. Who else did you work with on this homework? List names and email addresses. In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
- (b) Please copy the following statement and sign next to it:
I certify that all solutions are entirely in my words and that I have not looked at another students solutions. I have credited all external sources in this write up.
CS 189, Fall 2017, HW9 1
2 Classification Policy
Suppose we have a classification problem with classes labeled 1,,c and an additional doubt category labeled c+1. Let f : Rd {1,,c+1} be a decision rule. Define the loss function
0 if f(x) = y f(x) {1,,c},
L(f(x),y) = s if f(x) = y f(x) {1,,c}, (1) r iff(x)=c+1
where s 0 is the loss incurred for making a misclassification and r 0 is the loss incurred for choosing doubt. Hence, the risk of classifying a new data point x as class f (x) {1, 2, . . . , c + 1}
is
c
R(f(x)|x) = L(f(x),i)P(Y = i|x).
3
i=1
(a) Show that the following policy obtains the minimum risk:
(a) ChooseclassiifP(Y =i|x)P(Y = j|x)forall jandP(Y =i|x)1r. s
(b) Choose doubt otherwise.
(b) What happens if r = 0? What happens if r > s? Explain why this is consistent with
what one would expect intuitively.
LDA and CCA
Consider the following random variable X Rd , generated using a mixture of two Gaussians. Here, the vectors 1,2 Rd are arbitrary (mean) vectors, and Rdd represents a positive definite (covariance) matrix. For now, we will assume that we know all of these parameters.
Draw a label L {1, 2} such that the label 1 is chosen with probability 1 (and consequently, label 2 with probability 2 = 11), and generate X = N (L,).
(a) Now given a particular X Rd generated from the above model, we wish to find its label. Write out the decision rule corresponding to the following estimates of L:
MLE MAP
Your decision rule should take the form of a threshold: if some function f (X ) > T , then choose the label 1, otherwise choose the label 2. When are these two decision rules the same?
(b) You should have noticed that the function f is linear in its argument X, and takes the form w(X v). We will now show that CCA defined on a suitable set of random variables leads to precisely the same decision rule.
CS 189, Fall 2017, HW9 2
4
LetY R2 beaonehotvectordenotingtherealizationofthelabell,i.e.Yl =1ifL=l,and zero otherwise. Let 1 = 2 = 1/2. Compute the covariance matrices XX,XY and YY as a function of 1,2,. Recall that the random variables are not zero-mean.
(c) Let us now perform CCA on the two random variables. Recall that in order to find the first canonical directions, we look for vectors u Rd and v R2 such that (uX,vY) is maximized.
Show that the maximizing u is proportional to 1(1 2). Recall that u is that di- rection of X that contributes most to predicting Y . What is the relationship between u and the function f (X ) computed in part (a)?
Hint: The Sherman-Morrison formula for matrix inversion may be useful:
Suppose A Rdd is an invertible square matrix and a,b Rd are column vectors. Then,
T 1 1 A1abTA1 (A+ab ) =A 1+bTA1a.
- (d) Repeat the above part for arbitrary probabilities 1,2.
- (e) Now assume you dont know the parameters 1,2,, but are given samples of training data x1,x2,,xn drawn from the above distribution. You are then given a new Xtest, which you wish to classify. Using your intuition from CCA, write down a procedure that predicts the label of Xtest.
Sensors, Objects, and Localization (Part 2)
Let us say there are n objects and m sensors located in a 2d plane. The n objects are located at the points (x1,y1),,(xn,yn). The m sensors are located at the points (a1,b1),,(am,bm). We have measurements for the distances between the objects and the sensors: Di j is the measured distance from object i to sensor j. The distance measurement has noise in it. Specifically, we model
Dij = ||(xi,yi)(aj,bj)||+Zij,
where Zij N(0,1). The noise is independent across different measurements.
Starter code has been provided for data generation to aid your explorations. All Python libraries are permitted, though you must use your own implementation of a Neural Network if you want to get bonus points for implementing stochastic gradient descent. For others parts, you are free to use commercial libraries. You will need to plot ten figures in total in this problem.
Assume we observe Dij = dij with (Xi,Yi) = (xi,yi) for i = 1,,n and j = 1,,m. Here, m = 7. Our goal is to predict (Xi,Yi) from newly observed Di1,,Di7. For a data set with n test data, the error is measured by the average distance between the predicted object locations and the true object locations,
1n (xixi)2+(yiyi)2, n i=1
CS 189, Fall 2017, HW9
3
where (xi,yi) is the location of objects predicted by a model.
We are going to consider five models in this problem and compare their performance:
- In an earlier assignment, we first estimated sensor locations and then used the estimated sensor locations to estimate the new object locations. This is called a Generative Model.
- A Linear Model. Using the training set, the linear model attempts to learn (Xi,Yi) from (Di1,,Di7). Then it predicts (Xi,Yi) from (Di1,,Di7).
- A Second-Order Polynomial Regression Model. The set-up is similar to the linear model, but including second-order polynomial features.
- A Third-Order Polynomial Regression Model. The set-up is similar to the linear model, but including third-order polynomial features.
- A Neural Network Model. The Neural Network should have two hidden layers, each with 100 neurons, and use Relu as the non-linearity. (You are encouraged to explore on your own beyond this however. These parameters were chosen to teach you a hype-deflating lesson.)
- (a) Implement the five model listed above. Submit your code on gradescope and include it in your write-up.
- (b) Fix a set of 7 sensors and generate the following data sets:
- 15 training sets where ntrain varies from 10 to 290 in increments of 20.
- A regular testing data set where ntest = 1000.
- A shifted testing data set where ntest = 1000. You can do this by setting original dist to False in the function generate data in starter.py.
The difference between the regular testing data and the shifted testing data is that the regular testing data is drawn from the same distribution as the training data, whereas the shifted testing data is farther away. Train each of the five models on each of the fifteen training sets. Use your results to generate three figures. Each figure should include all of the models on the same plot so that you can compare them:
- A plot of training error versus ntrain (the amount of data used to train the model) for all of the models.
- A plot of testing error on the regular test set versus ntrain (the amount of data used to train the model) for all of the models.
- A plot of testing error on the shifted test set versus ntrain (the amount of data used to train the model) for all of the models.
Briefly describe your observations. What are the strengths and weaknesses of each model?
CS 189, Fall 2017, HW9 4
5
(c) We are now going to do some hyper-parameter tuning on our neural network. Fix the number of hidden layers to be two and let l be the number of neurons in each of these two layers. Try values for l between 100 and 500 in increments of 50. Use data sets with ntrain200 and ntest = 1, 000. What is the best choice for l (the number of neurons in the hidden layers)? Justify your answer with plots.
- (d) We are going to do some more hyper-parameter tuning on our neural network. Let k be the number of hidden layers and let l be the number of neurons in each hidden layer. Write a formula for the total number of weights in our network in terms of l and k. If we want to keep the total number of weights in the network approximately equal to 10000, find a formula for l in terms of k. Try values of k between 1 and 4 with the appropriate implied choice for l. Use data sets with ntrain = ntest = 200. What is the best choice for k (the number of layers)? Justify your answer with plots.
- (e) Repeat part ((b)), but for the neural network you should use the hyper-parameters you found through hyper-parameter tuning.
- (f) (Bonus) Instead of training your neural network model using gradient descent, use batch stochastic gradient descent. How does the neural network trained with SGD compare to the neural network trained with gradient descent? Justify your answer with plots. (For this part, you must use your own implementation.)
- (g) (Bonus) You might have seen that the neural network performance was dissapointing. Ex- plore increasing the training data as well as playing with the hyper-parameters on your own to improve neural network performance for samples drawn from the regular data set. Can you get anything to generalize to the shifted test data? Report your parameters and present your evidence.
Tensorflow Installation
Install Tensorflow and run the tutorial.
https://www.tensorflow.org/get_started/mnist/beginners
By now, hopefully you have a reasonably good understanding of how Backpropagation and SGD work to train Neural Networks. Implementing these on your own in a good learning experience, but when it comes time for practical applications you are probably going to use existing packages. Starting with the next homework, you will no longer be required to use your own implementation of Neural Networks. Instead, you may use Tensorflow, where SGD and Backpropagation are implemented for you. This week, you should get a head start by installing Tensorflow.
Submit your code (it is fine to copy/paste the tutorial, as long as you get it running) and the exact value you get for the training error on MNIST (it should be approximately 92%).
6 Your Own Question
Write your own question, and provide a thorough solution.
CS 189, Fall 2017, HW9 5
Writing your own problems is a very important way to really learn material. The famous Blooms Taxonomy that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we dont want the same to be true about the highest level.
As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.
Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You dont have to achieve this every week. But unless you try every week, it probably wont happen ever.
CS 189, Fall 2017, HW9 6
Reviews
There are no reviews yet.