1. Constructing kernels
In class, we saw that by choosing a kernel K(x,z) = (x)T (z), we can implicitly map data to a high dimensional space, and have the SVM algorithm work in that space. One way to generate kernels is to explicitly define the mapping to a higher dimensional space, and then work out the corresponding K.
However in this question we are interested in direct construction of kernels. I.e., suppose we have a function K(x,z) that we think gives an appropriate similarity measure for our learning problem, and we are considering plugging K into the SVM as the kernel function. However for K(x,z) to be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting from some feature mapping . Mercers theorem tells us that K(x,z) is a (Mercer) kernel if and only if for any finite set {x(1),,x(m)}, the matrix K is symmetric and positive semidefinite, where the square matrix K Rmm is given by Kij = K(x(i),x(j)).
Now here comes the question: Let K1, K2 be kernels over RnRn, let a R+ be a positive real number, let f : Rn 7 R be a real-valued function, let : Rn Rd be a function mapping from Rn to Rd, let K3 be a kernel over Rd Rd, and let p(x) a polynomial over x with positive coefficients.
For each of the functions K below, state whether it is necessarily a kernel. If you think it is, prove it; if you think it isnt, give a counter-example.
- K(x,z) = K1(x,z) + K2(x,z)
- K(x,z) = K1(x,z) K2(x,z)
- K(x,z) = aK1(x,z)
- K(x,z) = aK1(x,z)
- K(x,z) = K1(x,z)K2(x,z)
- K(x,z) = f(x)f(z)
- K(x,z) = K3((x),(z))
- K(x,z) = p(K1(x,z))
[Hint: For part (e), the answer is that the K there is indeed a kernel. You still have to prove it, though. (This one may be harder than the rest.) This result may also be useful for another part of the problem.]
2. Kernelizing the Perceptron
Let there be a binary classification problem with y {0,1}. The perceptron uses hypotheses of the form h(x) = g(T x), where g(z) = 1{z 0}. In this problem we will consider a stochastic gradient descent-like implementation of the perceptron algorithm where each update to the parameters is made using only one training example. However, unlike stochastic gradient descent, the perceptron algorithm will only make one pass through the entire training set. The update rule for this version of the perceptron algorithm is given by
(i+1) := (i) + [y(i+1) h(i)(x(i+1))]x(i+1)
where (i) is the value of the parameters after the algorithm has seen the first i training examples. Prior to seeing any training examples, (0) is initialized to ~0.
Let K be a Mercer kernel corresponding to some very high-dimensional feature mapping . Suppose is so high-dimensional (say, -dimensional) that its infeasible to ever represent (x) explicitly. Describe how you would apply the kernel trick to the perceptron to make it work in the high-dimensional feature space , but without ever explicitly computing (x). [Note: You dont have to worry about the intercept term. If you like, think of as having the property that 0(x) = 1 so that this is taken care of.] Your description should specify
- How you will (implicitly) represent the high-dimensional parameter vector (i), including how the initial value (0) = ~0 is represented (note that (i) is now a vector whose dimension is the same as the feature vectors (x));
- How you will efficiently make a prediction on a new input x(i+[1]). I.e., how you will compute h(i)(x(i+1)) = g((i)T (x(i+1))), using your representation of (i); and (c) How you will modify the update rule given above to perform an update to on a new training example (x(i+1),y(i+1)); i.e., using the update rule corresponding to the feature mapping :
(i+1) := (i) + [y(i+1) h(i)((x(i+1)))](x(i+1))
[Note: If you prefer, you are also welcome to do this problem using the convention of labels y {1,1}, and g(z) = sign(z) = 1 if z 0, 1 otherwise.]
3. Spam classification
In this problem, we will use the naive Bayes algorithm and an SVM to build a spam classifier.
In recent years, spam on electronic newsgroups has been an increasing problem. Here, well build a classifier to distinguish between real newsgroup messages, and spam messages. For this experiment, we obtained a set of spam emails, and a set of genuine newsgroup messages.1 Using only the subject line and body of each message, well learn to distinguish between the spam and non-spam.
All the files for the problem are in /afs/ir/class/cs229/ps/ps2/. Note: Please do not circulate this data outside this class. In order to get the text emails into a form usable by naive Bayes, weve already done some preprocessing on the messages. You can look at two sample spam emails in the files spamsampleoriginal*, and their preprocessed forms in the files spamsamplepreprocessed*. The first line in the preprocessed format is just the label and is not part of the message. The preprocessing ensures that only the message body and subject remain in the dataset; email addresses (EMAILADDR), web addresses (HTTPADDR), currency (DOLLAR) and numbers (NUMBER) were also replaced by the special tokens to allow them to be considered properly in the classification process. (In this problem, well going to call the features tokens rather than words, since some of the features will correspond to special values like EMAILADDR. You dont have to worry about the distinction.) The files newssampleoriginal and newssamplepreprocessed also give an example of a non-spam mail.
The work to extract feature vectors out of the documents has also been done for you, so you can just load in the design matrices (called document-word matrices in text classification) containing all the data. In a document-word matrix, the ith row represents the ith document/email, and the jth column represents the jth distinct token. Thus, the (i,j)-entry of this matrix represents the number of occurrences of the jth token in the ith document.
For this problem, weve chosen as our set of tokens considered (that is, as our vocabulary) only the medium frequency tokens. The intuition is that tokens that occur too often or too rarely do not have much classification value. (Examples tokens that occur very often are words like the, and, and of, which occur in so many emails and are sufficiently content-free that they arent worth modeling.) Also, words were stemmed using a standard stemming algorithm; basically, this means that price, prices and priced have all been replaced with price, so that they can be treated as the same word. For a list of the tokens used, see the file TOKENSLIST.
Since the document-word matrix is extremely sparse (has lots of zero entries), we have stored it in our own efficient format to save space. You dont have to worry about this format.[2] The file readMatrix.m provides the readMatrix function that reads in the document-word matrix and the correct class labels for the various documents. Code in nb train.m and nb test.m shows how readMatrix should be called. The documentation at the top of these two files will tell you all you need to know about the setup.
- Implement a naive Bayes classifier for spam classification, using the multinomial eventmodel and Laplace smoothing.
You should use the code outline provided in nbtrain.m to train your parameters, and then use these parameters to classify the test set data by filling in the code in nbtest.m. You may assume that any parameters computed in nbtrain.m are in memory when nb test.m is executed, and do not need to be recomputed (i.e., that nbtest.m is executed immediately after nbtrain.m) [3].
Train your parameters using the document-word matrix in MATRIX.TRAIN, and then report the test set error on MATRIX.TEST.
Remark. If you implement naive Bayes the straightforward way, youll find that the computed p(x|y) = Qi p(xi|y) often equals zero. This is because p(x|y), which is the product of many numbers less than one, is a very small number. The standard computer representation of real numbers cannot handle numbers that are too small, and instead rounds them off to zero. (This is called underflow.) Youll have to find a way to compute naive Bayes predicted class labels without explicitly representing very small numbers such as p(x|y). [Hint: Think about using logarithms.]
- Intuitively, some tokens may be particularly indicative of an email being in a particular class. We can try to get an informal sense of how indicative token i is for the SPAM class by looking at:
.
Using the parameters fit in part (a), find the 5 tokens that are most indicative of the SPAM class (i.e., have the highest positive value on the measure above). The numbered list of tokens in the file TOKENSLIST should be useful for identifying the words/tokens.
- Repeat part (a), but with training sets of size ranging from 50, 100, 200, , up to 1400, by using the files TRAIN.*. Plot the test error each time (use MATRIX.TEST as the test data) to obtain a learning curve (test set error vs. training set size). You may need to change the call to readMatrix in nbtrain.m to read the correct file each time. Which training-set size gives the best test set error?
- Train an SVM on this dataset using the LIBLINEAR SVM library, available for download from http://www.csie.ntu.edu.tw/cjlin/liblinear/. This implements an SVM using a linear kernel. Like the Naive Bayes implementation, an outline for your code is provided in m and svmtest.m.
See ps2/README.txt for instructions for downloading and installing LIBLINEAR. Similar to part (c), train an SVM with training set sizes 50, 100, 200, , 1400, by using the file MATRIX.TRAIN.50 and so on. Plot the test error each time, using MATRIX.TEST as the test data. Use the LIBLINEAR default options when training and testing. You dont need to try different parameter values.
Running LIBLINEAR in Matlab on Windows or Octave can be buggy, depending on which version of Windows you run. We recommend that you use Matlab on the corn machines (e.g., ssh to corn.stanford.edu). However, there are command line programs you can run (without using MATLAB) which are located in liblinear-1.7/windows for Windows and liblinear-1.7/ for Linux/Unix. If you do it this way, please include the commands that you run from the command line in your solution.
- How do naive Bayes and Support Vector Machines compare (in terms of generalizationerror) as a function of the training set size?
4. [20 points] Properties of VC dimension
In this problem, we investigate a few properties of the Vapnik-Chervonenkis dimension, mostly relating to how VC(H) increases as the set H increases. For each part of this problem, you should state whether the given statement is true, and justify your answer with either a formal proof or a counter-example.
- Let two hypothesis classes H1 and H2 satisfy H1 H2. Prove or disprove: VC(H1) VC(H2).
- Let H1 = H2 {h1,,hk}. (I.e., H1 is the union of H2 and some set of k additional hypotheses.) Prove or disprove: VC(H1) VC(H2) + k. [Hint: You might want to start by considering the case of k = 1.]
- Let H1 = H2 H3. Prove or disprove: VC(H1) VC(H2) + VC(H3).
5. Training and testing on different distributions
In the discussion in class about learning theory, a key assumption was that we trained and tested our learning algorithms on the same distribution D. In this problem, well investigate one special case of training and testing on different distributions. Specifically, we will consider what happens when the training labels are noisy, but the test labels are not.
Consider a binary classification problem with labels y {0,1}, and let D be a distribution over (x,y), that well think of as the original, clean or uncorrupted distribution. Define D to be a corrupted distribution over (x,y) which is the same as D, except that the labels y have some probability 0 < 0.5 of being flipped. Thus, to sample from D, we would first sample (x,y) from D, and then with probability (independently of the observed x and y) replace y with 1 y. Note that D0 = D.
The distribution D models a setting in which an unreliable human (or other source) is labeling your training data for you, and on each example he/she has a probability of mislabeling it. Even though our training data is corrupted, we are still interested in evaluating our hypotheses with respect to the original, uncorrupted distribution D. We define the generalization error with respect to D to be
(h) = P(x,y)D[h(x) 6= y].
Note that 0(h) is the generalization error with respect to the clean distribution; it is with respect to 0 that we wish to evaluate our hypotheses.
- For any hypothesis h, the quantity 0(h) can be calculated as a function of (h) and
. Write down a formula for 0(h) in terms of (h) and , and justify your answer.
- Let |H| be finite, and suppose our training set S = {(x(i),y(i));i = 1,,m} is obtained by drawing m examples IID from the corrupted distribution D. Suppose we pick h H using empirical risk minimization: h = argminhH S(h). Also, let h = argminhH 0(h).
Let any , > 0 be given. Prove that for
0(h) 0(h) + 2
to hold with probability 1 , it suffices that
.
Remark. This result suggests that, roughly, m examples that have been corrupted at noise level are worth about as much as (1 2)2m uncorrupted training examples. This is a useful rule-of-thumb to know if you ever need to decide whether/how much to pay for a more reliable source of training data. (If youve taken a class in information theory, you may also have heard that (1H())m is a good estimate of the information in the m corrupted examples, where H() = ( log2 + (1 )log2(1 )) is the binary entropy function. And indeed, the functions (12)2 and 1H() are quite close to each other.)
CS229 Problem Set #2 6
(c) Comment briefly on what happens as approaches 0.5.
[1] Thanks to Christian Shelton for providing the spam email. The non-spam messages are from the 20 newsgroups data at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html .
[2] Unless youre not using Matlab/Octave, in which case feel free to ask us about it.
[3] Matlab note: If a .m file doesnt begin with a function declaration, the file is a script. Variables in a script are put into the global namespace, unlike with functions.
Reviews
There are no reviews yet.