CS 189 Introduction to Machine Learning Fall 2017
This homework is due Thursday, December 7 at 10pm. 1 Getting Started
HW14
You may typeset your homework in latex or submit neatly handwritten and scanned solutions. Please make sure to start each question on a new page, as grading (with Gradescope) is much easier that way! Deliverables:
- Submit a PDF of your writeup to assignment on Gradescope, HW[n] Write-Up
- Submit all code needed to reproduce your results, HW[n] Code.
- Submit your test set evaluation results, HW[n] Test Set.
After youve submitted your homework, be sure to watch out for the self-grade form.
- (a) Before you start your homework, write down your team. Who else did you work with on this homework? List names and email addresses. In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
- (b) Please copy the following statement and sign next to it:
I certify that all solutions are entirely in my words and that I have not looked at another students solutions. I have credited all external sources in this write up.
CS 189, Fall 2017, HW14 1
2 K-SVD
As you have seen in earlier homework problems, sparse representations are powerful. When we know the right dictionary of features within which our input is likely to have a sparse represen- tations (possibly with some additional small non-sparse noise), we can use techniques like the LASSO or OMP (or even just rounding/thresholding) to recover the coefficients. As we have seen, this has a fundamentally better bias/variance tradeoff.
But what if we dont know the dictionary? How can the dictionary itself be found from the training data? This is the problem of dictionary learning. In this problem, we will introduce the K-SVD algorithm for unsupervised dictionary learning. (The naive perspective on directly supervised dic- tionary learning would reveal the dictionary to us, so that is not very interesting. However, it should be clear that the dictionary learning algorithm here can be adapted to deal with the case where we already know some dictionary elements to start with and want to learn more.) The K-SVD setting is where we would like to find a dictionary that enables good sparse fits to our data in the standard least-squares sense.
Mathematically, given an input matrix X RNd, where N is the number of data points and d is the dimension of each data point, we want to solve the following optimization problem:
in D:
K
xi zikDk, k=1
CS 189, Fall 2017, HW14
2
minimize X ZD2F DRKd ,ZRNK
subject to zi 0 s for all i.
(1)
In the above problem, D is called a dictionary, which is a matrix in RKd. We denote the kth row of D by DTk , where Dk Rd is called an atom. Z is a matrix in RNK whose ith row is zTi . The ith row of Z, zTi contains the coefficients for the linear combinations of atoms for the ith sample. v0 is the so called zero-norm of v, which is defined to be the number of nonzero entries in the vector v. Throughout the homework, we write D, Z as
DT 1
DT
2 D= .
DTK zT
1
zT 2
Z=.= Z1,Z2,,ZK
zTn withDk Rd,zi RK andZk RN.
A more human-language interpretation of the optimization problem is: Suppose the ith row of X represents a data sample xiT . Our goal is to approximate xi by a sparse linear combination of atoms
where the coefficients of linear combination are given in Z and should have at most s nonzero entries for each sample.
A greedy algorithm called K-SVD is proposed for the above optimization problem. The algorithm optimizes the objective over D and Z iteratively. The pseudo code of the algorithm is given in 1. It alternates between two stages: Sparse coding (Algorithm 2) and Update Codebook (Algorithm 4). The Sparse Coding procedure finds a sparse representation for each sample with Algorithm 3 based on a given codebook/dictionary. The procedure for updating the codebook/dictionary updates each row of the codebook with Algorithm 5 in a for loop, while modifying the effected representation coefficients correspondingly.
Algorithm 1 K-SVD
Require: Data matrix X RNd; Number of atoms K; Sparsity constraint s. Ensure: A dictionary D RKd, and the coefficient matrix Z RNK.
function K-SVD(X,K,s)
Initialize D = D0 by randomly choosing K samples without replacement from the dataset and aligning
them as rows of D0. Initialize Z by Z = Sparse-Coding(D,X,s,0). while not converged do
Z = Sparse-Coding(D,X,s,Z)
Update-Codebook(D,Z,X) end while
Return D,Z. end function
Algorithm 2 Sparse-Coding
Require: Dictionary D; Data matrix X RNd; Sparsity constraint s. Ensure: The coefficient matrix Z RNK.
function SPARSE-CODING(D,X,s) for n=1,. . . ,N do
zn = Sparse-Coding-Single(xn,s) ifxT zTD >xT zTD then
nnFnnF Update zn = zn
end if end for
Return Z = (z1,z2,,zN)T . end function
CS 189, Fall 2017, HW14 3
Algorithm 3 Sparse-Coding-Single
Require: Dictionary D RKd; Data matrix x Rd; Sparsity constraint s. Ensure: AcoefficientvectorzRK.
function SPARSE-CODING-SINGLE(D,x,s)
Initialize z = (0,0,,0) RK.
Initialize residue R1 = x. Initialize the basis set B1 = {}. for l = 1,,s do
Let Dk be the kth row of D.
Find an atom DTkl Rd from the dictionary D that minimizes the residue error of linear regression incorporating the extra atom:
,kl = argminx Rl,k DjBl{Dk}
Update the basis set Bl+1 = {Dkl } Bl . end for
ForDj Bs+1,setzj =j.
Return z. end function
Algorithm 4 Update-Codebook
jDj2.
Require: Coefficient matrix Z RNK; Data matrix X RNd; the dictionary D. Ensure: Dictionary D RKd
function UPDATE-CODEBOOK(Z,X,D) for k = 1,,K do
Update the kth row of D: Update-Codebook-Single(Z,X,D,k) end for
end function
Algorithm 5 Update-Codebook-single
Require: Coefficient matrix Z RNK; Data matrix X RNd; the dictionary D; k. function UPDATE-CODEBOOK-SINGLE(Z,X,D,k)
Find the indices of data samples that use this atom in their current sparse representation: k ={n|1nN,zik =0}.
Compute the error matrix Ek Rnd that represents how well things are currently represented without this atom.
E k = X Z j D Tj R n d j=k
Obtain EkR R|k|d by choosing the rows of Ek so that the index of that row is in k.
Compute the SVD decomposition EkR = UVT , with U R|k||k|, R|k|d and V Rdd. We assume is a diagonal matrix with non-increasing diagonal elements.
Update Dk to be the first column of V .
Update the nonzero elements of Zk to be the first column of U multiplied by 11, the largest singular value of EkR.
end function
CS 189, Fall 2017, HW14 4
Our goal in this problem is to establish a relationship of K-SVD to the K-means algorithm you already know, to show that this K-SVD algorithm converges to a locally minimum, and to verify the usefulness of the K-SVD on a toy data set.
(a) (Relationship to K-means) In class, weve seen K-means. Recall that given a data matrix X, K-means divides the data into K partitions = {1,2,,K}. Each sample xi is contained in one and only one partition j. K-means aims to solve the following optimization problem:
K
minimize xi k2 (2)
,RKd k=1 ik
is a K d matrix whose kth row is kT . Show that the above objective function is
equivalent to
minimize X Y 2F RKd,YRNK
(3)
subject to yi {0,1}K and yi 0 = 1 for all i. In the above Y RnK with the ith row being yi .
Thus, K-means is equivalent to K-SVD with s = 1 with the additional constraint of forcing Z to only contain the values 0 or 1.
(b) ShowthatinthegreedyAlgorithm3forfindingasparsesolution,eachstepl=1,2,,s cannot increase the residue error of linear regression.
(c) (Sparse coding decreases the objective) Conclude from the above part that Algorithm 2 cannot increase the value of the objective function (1). (In practice, people use straight matching pursuit Algorithm 6 to replace Algorithm 3, which is much more efficient, but gets comparable performance in most practical settings where there isnt a huge dynamic range of coefficients and the target sparsity s is low. But showing that matching pursuit usually works is omitted here.)
Algorithm 6 Matching Pursuit
Require: Dictionary D RKd; Data matrix x Rd; Sparsity constraint s. Ensure: AcoefficientvectorzRK.
function SPARSE-CODING-SINGLE(D,x,s) Initialize z = (0,0,,0) RK
Initialize residue R1 = x
for l = 1,,s do
Let Dk be the kth row of D.
Find kl = argmaxk|DTk Rl|
zkl =DTklRl/Dkl2
Update residue Rl+1 = Rl zkl Dkl
end for
Return z. end function
CS 189, Fall 2017, HW14 5
- (d) Show that the single-atom dictionary update given by Algorithm 5 does not increase the objective function while preserving the sparsity constraint. (You may or may not find the following hint useful.)
Hint: Recall the Eckart-Young theorem.
- (e) (Updating codebook decreases the objective) Conclude that the iterations of the K-SVD algorithm put together cannot increase the objective function and hence the objective function must converge.
- (f) (Implementation with test data set) Implement K-SVD using numpy, scipy and sklearn. In particular, use the provided function sparse coding(D,X,s) in the starter code for Sparse- Coding. Set the stopping criterion to be: stop if the difference between the old and new errors is smaller than a threshold (1 101.)
Then generate the following data set for testing your code:
- Generate a zero-one matrix Z RNK. Each row of Z has s randomly-chosen entries being one, with the rest of the entries being zero.
- Generate dictionary D RdK with entry-wise independent Gaussians N(0,1).
- Generate X = ZD.
For testing, we set N = 1000, d = 10, K = 10 and s = 2. Use the true K and s for testing your K-SVD code.
Remember to run the K-SVD algorithm with differently randomized initial conditions since only a local minimum is found and you want to find a good global minimum.
Report the reconstruction MSE, 1 X X 2 , where X = Z D , with Z and D found by K- NF
SVD.
- (g) (Frequency data) We use the following procedures to generate a toy data set:
Generate a dictionary D RKd by letting
Dkj =sin(2f(j)k)
K
for j = 1,2,,d and k = 1,2,,K. Thus each atom in the dictionary is
sin(2 f(i)) K
sin( 2 f (i) 2) D=K,
i . .
sin(2f(i) d) K
followedbynormalizationofeachatom. Herewechoose f(k)=k1fork=1,2,,K.
Randomly generate the coefficients Z RnK by generating each row zi RK inde- pendently with the following scheme: Randomly choose s entries without replacement from the K entries of zi. Set them to be 1. Set the rest of the K s entries to be zero.
CS 189, Fall 2017, HW14 6
Construct the data matrix X by X = ZD + cE , where each entry of E Rnd is gen- erated independently from N(0,1).
In the first experiment, we generate a data set with N = 200,s = 3,K = 5,d = 20,c = 0.
Apply K-SVD(X,K,s) with s = 1,2,,5 to find the estimated dictionary D and the
estimated coefficient matrix Z, and plot the reconstruction MSE 1 X X2 against s, NF
with X = ZD. (You dont need to include values of Z,D in submission.) Dont forget to try different random initializations.
- (h) In the second experiment, we generate a data set with N = 200;s = 2;K = 20;d = 5,c = 0. Apply K-SVD(X,K,s) with s = 1,2,,5 and plot the reconstruction MSE 1 X X2
against s. (You dont need to include values of Z,D in submission.)
- (i) Inthethirdexperiment,wegenerateadatasetwithN=200;s=3,K=20,d=5,c=0.001
101. Apply K-SVD(X,K,s) with s = 1,2,,5 to find the estimated dictionary D and
the estimated coefficient matrix Z, and plot the reconstruction MSE 1 X X2 against NF
s, with X = ZD. (You dont need to include values of Z,D in submission.)
- (j) In the fourth experiment, we generate a data set with N = 200;s = 2;K = 200;d = 100,c = 0.001. Apply K-SVD(X,K,s) and get the dictionary D. Plot any three rows (atoms) of the dictionary D on the same coordinate system, with x axis ranging among 1, 2, . . . , d = 100 being the index of entry and y axis being the value of the corresponding entry. If you are on the right track, you will see approximately sine curves.
GANs (Optional)
NF
3
In this problem, we will explore Generative Adversarial Networks (GANs) for learning Generative Models. Generative models try to learn the probability distribution over the state x (i.e. p(x)). The advantage of this is that now new data points can be sampled from the generative model. For instance, if a generative model was trained on the distribution of Picasso paintings and then sampled from, it should produce a new unseen Picasso-like painting.
Not surprisingly, generative models have been notoriously difficult to train because it is not entirely clear what loss they should be optimizing. For instance, is minimizing the squared Euclidean distance over pixel values appropriate for paintings? A recent result by Goodfellow et al. proposed an alternative to specifying a loss function known as a GAN. The idea behind a GAN comes from the following intuition.
Pretend there exists an art forger, who is attempting to fake a Picasso. The art forger successfully forges the painting, if they trick the Art Gallery into believing it is real. The Art Gallery is actively trying to detect if the painting is real or fake and thus they are competing against each other. The hope with GANs is that by creating this game, the art forger will learn to make better paintings to trick the Art Gallery. Thus, we do not need to ever explicitly specify a loss function of what a Picasso painting should look like, because this loss function is going to implicitly be learned by the Art Gallery as it compares true Picassos to known fakes.
CS 189, Fall 2017, HW14 7
The problem can be run on a Mac-Book Pro in 10 minutes and should not require EC2. Please do not import any new libraries.
- (a) We begin this problem by examining the MNIST dataset. Recall from previous homework that MNIST is a large dataset of handwritten digits ranging from 0 9 with each image being size 28 by 28. Consider each image as a data point x, we want to learn the distribution over possible images from the dataset p(x). One classical way1 to do this is known as Kernel Density Estimation (KDE). KDE can be written as follows
1 N1
p ( x ) = Z k ( x i , x ) ,i=0
where Z is chosen to normalize this to be a proper probability density function. Implicitly, the estimated density is determined by measuring how close the point is x to the training data in some kernel space. In this problem, we will consider the Gaussian Kernel (Radial Basis Function) for the Kernel or RBF. This can be written as
k(x,xi)=exp(||xxi||2),
where is the bandwidth of the kernel. The bandwidth measures how much weighting each training point has in the KDE. What this essentially is saying is that the learned distribution is noisy versions of the training data, where is controlling that noise. Implement the class kde.py and run train on mnist.py to sample MNIST digits from the learned distribu- tion. Report the resulting image. Note that in the code we will project the data down to a lower dimension with PCA to reduce computation.
- (b) The above KDE approach typically produce images that are quite blurry, this is a result of the distance measure we are using to measure similarity is interpolating between data points. In the RBF kernel, the distance measure is the squared L2 distance. While this is a common distance measure, it may not be appropriate to measure how similar images are when applied on the pixel level.
Instead of finding a more appropriate distance measure, we can train a Generator to simply try and trick a Discriminator. Please implement the discriminator, D, and the generator, G in gan.py. They should both have the same high level architecture:
(a) Layer 1: Fully Connected Layer
(b) Non-Linear Response: Rectified Linear Units (c) Layer 2: A Fully Connected Layer
(d) Non-Linear Response: Sigmoidal Layer
1There are plenty of others. For example, we can do something like QDA married to K-means style ideas where cluster centers are learned together with local covariances in an iterative fashion.
CS 189, Fall 2017, HW14 8
The dimensions of each layer are provided in the skeleton code. The discriminator will map an image to a probability score of whether it is real or not. The generator will map random noise in a latent space to a generated image.
(c) We are now ready to train our generator and discriminator. The objective function for the adversarial game can be written as follows:
minmaxV(D,G) = Exp(x)[log(D(x))]+Ezpz[1log(D(G(z))] GD
The objective is stating that we want our discriminator to correctly classifier the real data sampled from pdata(x) from the fake data sampled from pz(z). However, we want the Gen- erator to try and fool the discriminator. Thus forming an adversarial optimization.
In the original GAN paper, they proposed the following optimization to solve for this objec- tive.
Implement the rest of gan.py and report the resulting image after training the GAN.
(d) GANs yield a visible improvement in terms of matching the MNIST distribution. However, it is important to understand that the game theoretic formulation is not guaranteed to be stable and can lead to undesirable results. For example, one potential outcome in the GAN formulation is a phenomena known as mode collapse, in which the generator decides to only play a single value from the distribution. The discriminator is unable to distinguish this from the real data; however the true distribution is not captured by the generator.
We can see this by training our GAN on a 1D Gaussian distribution. Run train on gaussian.py and compare KDE to GANs for predicting a 1D Gaussian. Report your findings.
CS 189, Fall 2017, HW14 9
Reviews
There are no reviews yet.