[Solved] CS229 Problem #3-Theory & Unsupervised learning

$25

File Name: CS229_Problem_#3-Theory_&_Unsupervised_learning.zip
File Size: 442.74 KB

SKU: [Solved] CS229 Problem #3-Theory & Unsupervised learning Category: Tag:
5/5 - (1 vote)

1. Uniform convergence

You are hired by CNN to help design the sampling procedure for making their electoral predictions for the next presidential election in the (fictitious) country of Elbania.

The country of Elbania is organized into states, and there are only two candidates running in this election: One from the Elbanian Democratic party, and another from the Labor Party of Elbania. The plan for making our electorial predictions is as follows: Well sample m voters from each state, and ask whether theyre voting democrat. Well then publish, for each state, the estimated fraction of democrat voters. In this problem, well work out how many voters we need to sample in order to ensure that we get good predictions with high probability.

One reasonable goal might be to set m large enough that, with high probability, we obtain uniformly accurate estimates of the fraction of democrat voters in every state. But this might require surveying very many people, which would be prohibitively expensive. So, were instead going to demand only a slightly lower degree of accuracy.

Specifically, well say that our prediction for a state is highly inaccurate if the estimated fraction of democrat voters differs from the actual fraction of democrat voters within that state by more than a tolerance factor . CNN knows that their viewers will tolerate some small number of states estimates being highly inaccurate; however, their credibility would be damaged if they reported highly inaccurate estimates for too many states. So, rather than trying to ensure that all states estimates are within of the true values (which would correspond to no states estimate being highly inaccurate), we will instead try only to ensure that the number of states with highly inaccurate estimates is small.

To formalize the problem, let there be n states, and let m voters be drawn IID from each state. Let the actual fraction of voters in state i that voted democrat be i. Also let Xij (1 i n,1 j m) be a binary random variable indicating whether the j-th randomly chosen voter from state i voted democrat:

1 if the jth example from the ith state voted democrat

ij =

0 otherwise

We assume that the voters correctly disclose their vote during the survey. Thus, for each value of i, we have that Xij are drawn IID from a Bernoulli(i) distribution. Moreover, the Xijs (for all i,j) are all mutually independent.

After the survey, the fraction of democrat votes in state i is estimated as:

Also, let Zi = 1{|i i| > } be a binary random variable that indicates whether the prediction in state i was highly inaccurate.

  • Let i be the probability that Zi = 1. Using the Hoeffding inequality, find an upper bound on i.
  • In this part, we prove a general result which will be useful for this problem. Let Vi and Wi (1 i k) be Bernoulli random variables, and suppose

E[Vi] = P(Vi = 1) P(Wi = 1) = E[Wi] i {1,2,k}

Let the Vis be mutually independent, and similarly let the Wis also be mutually independent. Prove that, for any value of t, the following holds: !

[Hint: One way to do this is via induction on k. If you use a proof by induction, for the base case (k = 1), you must show that the inequality holds for t < 0, 0 t < 1, and t 1.]

  • The fraction of states on which our predictions are highly inaccurate is given by

. Prove a reasonable closed form upper bound on the probability P(Z > ) of being highly inaccurate on more than a fraction of the states.

[Note: There are many possible answers, but to be considered reasonable, your bound must decrease to zero as m (for fixed n and > 0). Also, your bound should either remain constant or decrease as n (for fixed m and > 0). It is also fine

if, for some values of , m and n, your bound just tells us that P(Z > ) 1 (the trivial bound).]

2.More VC dimension

Let the domain of the inputs for a learning problem be X = R. Consider using hypotheses of the following form:

h(x) = 1{0 + 1x + 2x2 + + dxd 0},

and let H = {h : Rd+[1]} be the corresponding hypothesis class. What is the VC dimension of H? Justify your answer.

[Hint: You may use the fact that a polynomial of degree d has at most d real roots. When doing this problem, you should not assume any other non-trivial result (such as that the VC dimension of linear classifiers in d-dimensions is d + 1) that was not formally proved in class.]

3. LOOCV and SVM

  • Linear Case. Consider training an SVM using a linear Kernel K(x,z) = xTz on a training set {(x(i),y(i)) : i = 1,,m} that is linearly separable, and suppose we do not use 1 Let |SV | be the number of support vectors obtained when training on the entire training set. (Recall x(i) is a support vector if and only if i > 0.) Let LOOCV denote the leave one out cross validation error of our SVM. Prove that

LOOCV.

  • General Case. Consider a setting similar to in part (a), except that we now run an SVM using a general (Mercer) kernel. Assume that the data is linearly separable in the high dimensional feature space corresponding to the kernel. Does the bound in part (a) on LOOCV still hold? Justify your answer.

4. [12 points] MAP estimates and weight decay

Consider using a logistic regression model h(x) = g(Tx) where g is the sigmoid function, and let a training set {(x(i),y(i));i = 1,,m} be given as usual. The maximum likelihood estimate of the parameters is given by

m Y

ML = argmax p(y(i)|x(i);).

i=1

If we wanted to regularize logistic regression, then we might put a Bayesian prior on the parameters. Suppose we chose the prior N(0,2I) (here, > 0, and I is the n+1-byn + 1 identity matrix), and then found the MAP estimate of as:

m Y

MAP = argmaxp() p(y(i)|x(i),)

i=1

Prove that

||MAP||2 ||ML||2

[Hint: Consider using a proof by contradiction.]

Remark. For this reason, this form of regularization is sometimes also called weight decay, since it encourages the weights (meaning parameters) to take on generally smaller values.

5. KL divergence and Maximum Likelihood

The Kullback-Leibler (KL) divergence between two discrete-valued distributions P(X),Q(X) is defined as follows:1

K

For notational convenience, we assume P(x) > 0,x. (Otherwise, one standard thing to do is to adopt the convention that 0log0 = 0.) Sometimes, we also write the KL divergence as KL(P||Q) = KL(P(X)||Q(X)).

The KL divergence is an assymmetric measure of the distance between 2 probability distributions. In this problem we will prove some basic properties of KL divergence, and work out a relationship between minimizing KL divergence and the maximum likelihood estimation that were familiar with.

(a) Nonnegativity. Prove the following:

P,Q KL(PkQ) 0

and

KL(PkQ) = 0 if and only if P = Q.

[Hint: You may use the following result, called Jensens inequality. If f is a convex function, and X is a random variable, then E[f(X)] f(E[X]). Moreover, if f is strictly convex (f is convex if its Hessian satisfies H 0; it is strictly convex if H > 0; for instance f(x) = logx is strictly convex), then E[f(X)] = f(E[X]) implies that X = E[X] with probability 1; i.e., X is actually a constant.]

  • Chain rule for KL divergence. The KL divergence between 2 conditional distributions P(X|Y ),Q(X|Y ) is defined as follows: K!

This can be thought of as the expected KL divergence between the corresponding conditional distributions on x (that is, between P(X|Y = y) and Q(X|Y = y)), where the expectation is taken over the random y. Prove the following chain rule for KL divergence:

KL(P(X,Y )kQ(X,Y )) = KL(P(X)kQ(X)) + KL(P(Y |X)kQ(Y |X)).

  • KL and maximum likelihood.

Consider a density estimation problem, and suppose we are given a training set

{x(i);i = 1,,m}. Let the empirical distribution be.

(P is just the uniform distribution over the training set; i.e., sampling from the empirical distribution is the same as picking a random example from the training set.) Suppose we have some family of distributions P parameterized by . (If you like, think of P(x) as an alternative notation for P(x;).) Prove that finding the maximum likelihood estimate for the parameter is equivalent to finding P with minimal KL divergence from P. I.e. prove:

m

argminKL(PkP) = argmax logP(x(i)) X

i=1

Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli Naive Bayes parameter estimation. In the Naive Bayes model we assumed P is of the following form: ). By the chain rule for KL divergence, we therefore have:

n

KL(PkP) = KL(P(y)kp(y)) + XKL(P(xi|y)kp(xi|y)).

i=1

This shows that finding the maximum likelihood/minimum KL-divergence estimate of the parameters decomposes into 2n + 1 independent optimization problems: One for the class priors p(y), and one for each of the conditional distributions p(xi|y) for each feature xi given each of the two possible labels for y. Specifically, finding the maximum likelihood estimates for each of these problems individually results in also maximizing the likelihood of the joint distribution. (If you know what Bayesian networks are, a similar remark applies to parameter estimation for them.)

6. K-means for compression

In this problem, we will apply the K-means algorithm to lossy image compression, by reducing the number of colors used in an image.

The directory /afs/ir.stanford.edu/class/cs229/ps/ps3/ contains a 512512 image of a mandrill represented in 24-bit color. This means that, for each of the 262144 pixels in the image, there are three 8-bit numbers (each ranging from 0 to 255) that represent the red, green, and blue intensity values for that pixel. The straightforward representation of this image therefore takes about 262144 3 = 786432 bytes (a byte being 8 bits). To compress the image, we will use K-means to reduce the image to k = 16 colors. More specifically, each pixel in the image is considered a point in the three-dimensional (r,g,b)space. To compress the image, we will cluster these points in color-space into 16 clusters, and replace each pixel with the closest cluster centroid.

Follow the instructions below. Be warned that some of these operations can take a while (several minutes even on a fast computer)![2]

  • Copy mandrill-large.tiff from /afs/ir.stanford.edu/class/cs229/ps/ps3 on the leland system. Start up MATLAB, and type A = double(imread(mandrill-large.tiff)); to read in the image. Now, A is a three dimensional matrix, and A(:,:,1), A(:,:,2) and A(:,:,3) are 512512 arrays that respectively contain the red, green, and blue values for each pixel. Enter imshow(uint8(round(A))); to display the image.
  • Since the large image has 262144 pixels and would take a while to cluster, we will instead run vector quantization on a smaller image. Repeat (a) with mandrill-small.tiff. Treating each pixels (r,g,b) values as an element of R3, run K-means3 with 16 clusters on the pixel data from this smaller image, iterating (preferably) to convergence, but in no case for less than 30 iterations. For initialization, set each cluster centroid to the (r,g,b)-values of a randomly chosen pixel in the image.
  • Take the matrix A from mandrill-large.tiff, and replace each pixels (r,g,b) values with the value of the closest cluster centroid. Display the new image, and compare it visually to the original image. Hand in all your code and a printout of your compressed image (printing on a black-and-white printer is fine).

6

  • If we represent the image with these reduced (16) colors, by (approximately) whatfactor have we compressed the image?

[1] If P and Q are densities for continuous-valued random variables, then the sum is replaced by an integral, and everything stated in this problem works fine as well. But for the sake of simplicity, in this problem well just work with this form of KL divergence for probability mass functions/discrete-valued distributions.

[2] In order to use the imread and imshow commands in octave, you have to install the Image package from octave-forge. This package and installation instructions are available at: http://octave.sourceforge.net 3Please implement K-means yourself, rather than using built-in functions from, e.g., MATLAB or octave.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CS229 Problem #3-Theory & Unsupervised learning
$25