[SOLVED] CS 189 HW5

$25

File Name: CS_189_HW5.zip
File Size: 94.2 KB

5/5 - (1 vote)

CS 189 Introduction to Machine Learning Fall 2017

This homework is due Friday, September 29 at 10pm. 1 Getting Started

HW5

You may typeset your homework in latex or submit neatly handwritten and scanned solutions. Please make sure to start each question on a new page, as grading (with Gradescope) is much easier that way! Deliverables:

1. Submit a PDF of your writeup to assignment on Gradescope, HW[n] Write-Up 2. Submit all code needed to reproduce your results, HW[n] Code.
3. Submit your test set evaluation results, HW[n] Test Set.

After youve submitted your homework, be sure to watch out for the self-grade form.

(a) Before you start your homework, write down your team. Who else did you work with on this homework? List names and email addresses. In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?

(b) Please copy the following statement and sign next to it:

I certify that all solutions are entirely in my words and that I have not looked at another students solutions. I have credited all external sources in this write up.

CS 189, Fall 2017, HW5 1

2 Properties of Convex Functions

In lecture, we will introduce gradient descent as a powerful general optimization tool which, under most conditions, will converge to a local minimum of an objective function. However, a local minimum may not always be a good solution. When a function is convex, its local minima are global minima. Thus, gradient descent is an especially powerful tool for numerically optimizing convex functions. It is very useful to recognize when a function is convex or not.

Before we get to a convex function, though, lets talk about convex sets. A convex set is a set S where

x1 S,x2 S = x1+(1)x2 S,
where0 1. Inwords,aconvexsetisasetSwherex1 inSandx2 inSimpliesthateverything

in between x1 and x2 is also in S.
There are several equivalent ways to define a convex function. Here are three:

  • A function f is convex if
    f(x1)+(1)f(x2) f(x1 +(1)x2)

    foranyx1 andx2 inthedomainof f and0 1.

  • A function f is convex if the set Sf = {(x,y)|x Rn,y R,y f(x)} is convex. The set Sf

    is called the epigraph of the function f . It is all the points that lie above the curve f .

  • A function f is convex if its Hessian matrix (the matrix of second partial derivatives) is positive semi-definite everywhere on the domain of f . This definition assumes the Hessian matrix exists everywhere on the domain, which is not true for non-differentiable functions.
  1. (a) Give an example of a function f from R to R which is convex, but not differentiable at at least one point in the domain of f .
  2. (b) Compute the Hessian of the function f which maps a vector x to its loss in the ordinary least squares problem. As a reminder, the function f is

    f(x)=||Axy||2,
    where x Rd and A and y are constants. This is the function we minimize to get the optimal

    solution x. Argue that the Hessian of f is positive semi-definite.

  3. (c) Prove that if f is convex and if x0 is a local minimum of f , then x0 is also a global mini- mum. Remember that a local minimum is defined as follows: if x0 is a local minimum, then there exists an such that

    | | x 0 x | | < = f ( x ) f ( x 0 ) .
    (Hint: first express clearly what does it mean for a function to have a global minimum at x0.)

CS 189, Fall 2017, HW5 2

(d) If f and g are convex functions, consider the functions h defined below. Either prove that h is always convex (for any f and g) or provide a counter-example (where f and g are convex, but h is not):

3

(i) h(x)= f(x)+g(x)
(ii) h(x)=min{f(x),g(x)} (iii) h(x)=max{f(x),g(x)}

(iv) h(x) = f (g(x))
Canonical Correlation Analysis

In this problem, we will work our way through the singular value decomposition, and show how it helps yield a solution to the problem of canonical correlation analysis.

(a) Let n d. For a matrix A Rnd having full column rank and singular value decomposition A = UV, we know that the singular values are given by the diagonal entries of , and the left (resp. right) singular vectors are the columns of the matrix U (resp. V ). Both U and V have orthonormal columns.

Show that A = di=1 iuivi , where the ith singular value is denoted by i = ii and ui and vi are the ith left and right singular vectors, respectively.

  1. (b) With the setup above, show that
    i) AA has ith eigenvalue i = i2, with associated eigenvector vi. i) AA has ith eigenvalue i = i2, with associated eigenvector ui. Notice that both of the above matrices are symmetric.
  2. (c) Use the first part to show that

    1(A)= max uAv, u:u2 =1

    v:v2 =1 where 1(A) is the maximum singular value of A.

    Additionally, show that if A has a unique maximum singular value, then the maximizers (u,v) above are given by the first left and right singular vectors, respectively.

    Hint 1: You can express any u : u2 = 1 as a linear combination of left singular vectors of the matrix A, and similarly with v and right singular vectors. You may or may not find this hint useful.

    Hint 2: You may find the following facts that hold for any two vectors a, b Rd useful: Cauchy-Schwarz inequality: |ab| a2b2, with equality when b is a scaled version of a. Holders inequality: |ab| a1b. Here, the l1 and l norms of a vector v are defined by v1 = i |vi|, and v = maxi |vi|. Let us say the vector b is fixed; then one way to achieve equality in the Holder inequality is to have:

CS 189, Fall 2017, HW5 3

Let i be such that |bi| = b.
Setai =a1,andvi =0forall j=i.

(d) Define the correlation coefficient between two scalar zero-mean random variables P and Q as E[PQ]

(P,Q) = E[P2]E[Q2].
Let us now look at the canonical correlation analysis problem, where we are given two (say,

zero mean) random vectors X,Y Rd, with covariance (and variance) given by

E[XX] = XX E[YY]=YY E[XY]=XY.

Note that a linear combination of the random variables in X can be written as aX, and sim- ilarly, a linear combination of RVs in Y can be written as bY. Note that aX,bY R are scalar random variables.

The goal of CCA is to find linear combinations that maximize the correlation. In other words, we want to solve the problem

= max (aX,bY). a,bRd

Show that the problem can be rewritten as

= max aXYb . a,bRd aXXa 1/2 bYYb 1/2

Conclude that if (a,b) is a maximizer above, then (a,b) is a maximizer for any , >0.

(e) Assume that the covariance matrices XX and YY are full rank, and denote the maximizers in the above problem by (a,b). Use the above parts to show that

i) 2 is the maximum eigenvalue of the matrix
1/2XY 1XY 1/2.

XX YY XX

ii) 1/2a is the maximal eigenvector of the matrix XX

1/2XY 1XY 1/2. XX YY XX

iii) 1/2b is the maximal eigenvector of the matrix YY

1/2 1 1/2. YY XY XX XY YY

Hint: An appropriate change of variables may make your life easier.
CS 189, Fall 2017, HW5 4

(f) Argue why vanilla CCA is useless when the random vectors X and Y are uncorrelated, where by this we mean that cov(Xi,Yj) = 0 for all i, j.. If you happen to know that X and Y2 (where this is defined by squaring each entry of Y ) share a linear relationship, how might you modify the CCA procedure to account for this?

4 Mooney Reconstruction

In this problem, we will try to restore photos of celebrities from Mooney photos, which are bina- rized faces. In order to do this, we will leverage a large training set of grayscale faces and Mooney faces.

Producing a face reconstruction from a binarized counterpart is a challenging high dimensional problem, but we will show that we can learn to do so from data. In particular, using the power of Canonical Correlation Analysis (CCA), we will reduce the dimensionality of the problem by projecting the data into a subspace where the images are most correlated.

Images are famously redundant and well representable in lower-dimensional subspaces as the eigenfaces example in class showed. However, here our goal is to relate two different kinds of images to each other. Lets see what happens.

Figure 1: A
binarized Mooney image of an face being restored to its original grayscale image.

The following datasets will be used for this project: X train.p, Y train.p, X test.p and Y test.p. The training data X train.p contains 956 binarized images, where xi R15153. The test data X test.p contains 255 binarized images with the same dimensionality. Y train.p contains 956 corresponding grayscale images, where yi R15153. Y test.p contains 255 grayscale images that correspond to X test.p.

Through out the problem we will assume that all data points are flattened and standardize as fol- lows:

x=(x/255)2.01.0

CS 189, Fall 2017, HW5

5

(Note, however, that this standardization during loading the data does not remove the need to do actual mean removal during processing.)

Please use only the following libraries to solve the problem in Python 3.0:

1

3

5

7

import pickle
from scipy . linalg
from scipy . linalg
from numpy. linalg
from numpy. linalg
import matplotlib
from sklearn . preprocessing import StandardScaler

import import import import

eig sqrtm inv svd
as plt

. pyplot

  1. (a) We use CCA to find pairs of directions a and b that maximize the correlation between the projections x = aT x and y = bT y, From the previous problem, we know that this can be done by solving the problem

    max = max aXY b , a,b aXXa 1/2 bYYb 1/2

    where XX and YY denote the covariance matrices of the X and Y images, and XY denotes the covariance between X and Y . Note that unlike in the previous problem, we are not given the covariance matrices and must estimate them from X,Y image samples.

    Write down how to estimate the three covariance matrices from finite samples of data and implement the code for it.

  2. (b) We know from the previous problem that we are interested in the maximum singular value of the matrix 1/2XY 1/2, and that this corresponds to the maximum correlation coefficient

XX YY

.

Now, however, plot the full spectrum of singular values of the matrix (XX +I)1/2XY (YY + I)1/2. Note for numerical issues we need to add a very small scalar times the identity matrix
to the covariance terms. Set = 0.00001.

(c) You should have noticed from the previous part that we have some singular value decay. It therefore makes sense to only consider the top singular value(s), as in CCA. Let us now try to project our images x on to the subspace spanned by the top k singular values. Given that the SVDUVT =1/2XY1/2,wecanusethelefthandeigenvectorstoformaprojection.

Pk = u0, u1, uk ,
Show/visualize the face corresponding to the first eigenvector u0. . Use the following

code for the visualization.

1

3

XX YY

def plot image ( self , vector ) :
vector = (( vector +1.0) /2.0) *255.0

CS 189, Fall 2017, HW5 6

vector = np.reshape(vector ,(15,15,3)) p = vector.astype(uint8)

p = cv2.resize(p,(100,100)) count = 0

cv2.imwrite(eigen face.png,p)

5

7

9

11

  1. (d) We will now examine how well the projected data helps generalization when performing re- gression. You can think of CCA as a technique to help learn better features for the problem. We will use ridge regression regression to learn a mapping, w Rk675, from the projected bi- narized data to the grayscale images. The binarized images are placed in matrix X R956675

    min||(X Pk )w Y ||2 + ||w||2 w

    Implement Ridge Regression with = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to):

    k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 650}.

  2. (e) Try running the learned model on 4 of the images in the test set and report the results. Give both the binarized input, the true grayscale, and the output of your model. Note: You can use the code from above to visualize the images.

    Mooney Ground Truth Predicted

    Figure 2: E
    xample results with the input on the left and output on the right

5 Fruits!

The goal of the problem is help the class build a dataset for image classification, which will be used later in the course to classify fruits and vegetables. Please pick ten of the following fruits:

1. Apple
CS 189, Fall 2017, HW5 7

2. Banana
3. Oranges
4. Grapes
5. Strawberry 6. Peach

7. Cherry
8. Nectarine 9. Mango

10. Pear
11. Plum
12. Watermelon 13. Pineapple

Take two pictures of each specific fruit, for a total of 20 fruit pictures, against any background such that the fruit is centered in the image and the fruit takes up approximately a third of the image in area; see below for examples. Save these pictures as .png files. Do not save the pictures as .jpg files, since these are lower quality and will interfere with the results of your future coding project. Place all the images in a folder titled data. Each image should be titled [fruit name] [number].png where [number] {0, 1}. Ex: apple 0.png, apple 1.png, banana 0.png, etc . . . (the ordering is irrel- evant). Please also include a file titled rich labels.txt which contain entries on new lines prefixed by the file name [fruit name] [number], followed by a description of the image (maximum of eight words) with only a space delimiter in between. Ex: apple 0 one fuji red apple on wood table. To turn in the folder compress the file to a .zip and upload it to Gradescope.

Figure 3: Example of properly centered images of four bananas against a table (left) and one orange on a tree (right).

CS 189, Fall 2017, HW5 8

Please keep in mind that data is an integral part of Machine Learning. A large part of research in this field relies heavily on the integrity of data in order to run algorithms. It is, therefore, vital that your data is in the proper format and is accompanied by the correct labeling not only for your grade on this section, but for the integrity of your data.

Note that if your compressed file is over 100 MB, you will need to downsample the images. You can use the following functions from skimage to help.

1

from skimage . io import imread , imsave from skimage . transform import resize

6 Your Own Question

Write your own question, and provide a thorough solution.

Writing your own problems is a very important way to really learn material. The famous Blooms Taxonomy that lists the levels of learning is: Remember, Understand, Apply, Analyze, Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HW questions about the lowest level of straight-up remembering, expecting you to be able to do that yourself. (e.g. make yourself flashcards) But we dont want the same to be true about the highest level.

As a practical matter, having some practice at trying to create problems helps you study for exams much better than simply counting on solving existing practice problems. This is because thinking about how to create an interesting problem forces you to really look at the material from the perspective of those who are going to create the exams.

Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. But it is more fun to really engage with the material, discover something interesting, and then come up with a problem that walks others down a journey that lets them share your discovery. You dont have to achieve this every week. But unless you try every week, it probably wont happen ever.

CS 189, Fall 2017, HW5 9

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS 189 HW5
$25