, if x = x0 , x,x0 RD. (1) otherwise
Q1.1 Prove that this is a valid kernel. You can apply the Mercers theorem mentioned in the lectures (lec10.pdf) and assume that the N points {x1, ,xN} you pick are distinct (i.e., xi 6= xj if i 6= j).
What to submit: No more than 10 lines of proof.
Q1.2 Suppose now you are given a training set for a regression problem, where xi 6= xj if i 6= j. Show that by using this kernel, training a kernel ridge regressor (lec10.pdf) with = 0 will always lead to the training objective of 0meaning that all the training examples are predicted accurately by the learned regressor. The training objective of kernel ridge regression is as follows
K yT K K, (2)
where K RNN is the kernel matrix, y = [y1,y2, ,yN]T , and RN. The learned regressor is
f(x) = [k(x,x1),k(x,x2), ,k(x,xN)], (3)
where = argmin J()
What to submit: No more than 5 lines of derivations, plus 1 line to show what is.
Q1.3 Although the learned regressor can accurately predict the value of each training example, it does not generalize to the test data. Specifically, show that for any x with x 6= xn,n = 1,2, ,N, the predicted value is always 0.
What to submit: No more than 5 lines of derivations.
2 Support Vector Machines
[Recommended maximum time spent: 2 hours]
Consider the dataset consisting of points (x,y), where x is a real value, and y {1, 1} is the class label. Lets start with three points (x1,y1) = (1, 1), (x3,y3) = (0, 1), (x2,y2) = (1,1).
Figure 1: Three data points considered in Q2
Q2.1 Can three points shown in Figure 1, in their current one-dimensional feature space, be perfectly separated with a linear separator? Why or why not?
What to submit: No more than 3 lines of reasoning.
Q2.2 Now we define a simple feature mapping (x) = [x,x2]T to transform the three points from one- to two-dimensional feature space. Plot the transformed points in the new two-dimensional feature space (use any package you prefer for the plot, e.g., Matplotlib, PowerPoint). Is there a linear decision boundary that can separate the points in this new feature space? Why or why not? What to submit: The plot and no more than 3 lines of reasoning.
Q2.3 Given the feature mapping (x) = [x,x2]T , write down the kernel function k(x,x0). Moreover, write down the 3 3 kernel (or Gram) matrix K based on k(xi,xj) of the three data points. Verify that K is a positive semi-definite (PSD) matrix. You may want to show this by the definition of PSD matrix: a symmetric N N real matrix M is said to be positive semi-definite if the scalar zT Mz is non-negative for every non-zero column vector z of N real numbers.
What to submit: The form of kernel function k, the kernel matrix K, and no more than 10 lines of proof to show that K is PSD.
Q2.4 Now you are going to learn a hard margin SVM classifier on this dataset with k(xi,xj). Recall the primal and dual formulation you learned in lecture 11 (lec11.pdf). Write down the primal and dual formulations of this problem.
What to submit: Primal and dual formulations. Each with no more than 5 lines.
Q2.5 Next, you are going to solve this problem using its dual formulation. Recall the toy example we walked through in lecture 12 (lec12.pdf). You may want to borrow a similar symmetric property to simplify the dual formulation.
What to submit: Solve the dual formulation and report the vector R3, the weight vector w R2 and the bias b R. No more than 10 lines of derivation.
Q2.6 Let y = wT (x) + b, where w and b are the weights and the bias you got from the previous question. Draw the decision boundary in the new two-dimensional feature space and circle all support vectors. (Set y to 0 to get the decision boundary). Then, draw the decision boundary in the original one-dimensional setting.
What to submit: Two plots: one in the new two-dimensional feature space, and the other in the original one-dimensional feature space.
3 Adaboost for building up a nonlinear classifier
[Recommended maximum time spent: 2 hour]
In the lectures (lec13.pdf), you learned an algorithm, Adaboost, which constructs a strong (binary) classifier based on iteratively adding one weak (binary) classifier into it. A weak classifier is learned to maximize the weighted training accuracy at the corresponding iteration, and the weight of each training example is updated after every iteration. Algorithm 1 summarizes the algorithm of Adaboost.
Given
H: A set of functions, where h H takes a D-dimensional vector as input and outputs either +1 or 1.
A training set.
Goal Learn )], where ft H and t R.
Initialization w1(n) = N1 ,n {1,2, ,N}.
for t = 1,2, ,T do
- find ft = argminhH Pn wt(n)I[yn 6= h(xn)]
- compute
(c) compute
t
(
wt(n)exp(t), if yn = ft(xn)
- compute wt+1(n) = ,n {1,2, ,N}
wt(n)exp(t), otherwise
- normalizeend
Algorithm 1: The Adaboost algorithm
Figure 2: The two training sets considered in Q3: (A) for Q3.1, Q3.2, and (B) for Q3.3, Q3.4, Q3.5, Q3.6.
In this question, you are going to experiment with the learning process of Adaboost, with decision stumps as H. A decision stump h H is a classifier characterized by a triplet (s {+1,1},b R,d {1,2, ,D}) such that
( s, if xd > b,
h(s,b,d)(x) = (4)
s, otherwise.
That is, each decision stump function only looks at a single dimension/entry xd of the input vector x, and check whether xd is larger than b or not (s decides which label to give if xd > b).
Specifically, you are given a simple 4-example training set (as illustrated in Fig. 2 (A)):
For simplicity, lets consider
H = {h(s,b,d) | s {+1,1},b {2,0.5,0.5,2},d {1,2}}. (5)
That is, H contains only 16 distinct decision stump functions (8 horizontal boundaries and 8 vertical boundaries).
Q3.1 Lets get started to run Adaboost for T = 3. At t = 1, please find the best decision stump f1 (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f1. What are the corresponding based on f1?
What to submit: Write down the triplet (s,b,d) of f1, and the values of . Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.2 From your results in Q3.1 and Algorithm 1, you will observe something interesting about Adaboost. That is, if t = 0, the corresponding iteration t does not contribute anything to the final decision function F(x). This might be fine since we can move on to the next iteration and probably get t+1 6= 0 there. Lets have a try. Please write down w2(n),n {1,2, ,N}, and check the objective function to solve at t = 2 (i.e., solve step (a) in Algorithm 1). You will observe that the objective function at t = 2 remains exactly the SAME as at t = 1. The Adaboost algorithm thus will likely get stuck after iteration t = 1.
What to submit: Write down w2(n),n {1,2,3,4}. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.3 From the observations in Q3.2, it seems like Adaboost is not really learning something strong. We argue that it is because of the form of decision stumps: each of them can only draw a horizontal or vertical decision boundary in our 2-dimensional case. Can we do any simple trick to solve this issue?
Lets try one trick that you have learned, feature transform. Specifically, we apply a simple linear transformation for every example
x, (6)
which results in a transformed training set
.
From now on, lets only consider this new training set (as illustrated in Fig. 2 (B)), but use the same H defined in eq. (5).
Lets get started again to run Adaboost for T = 3, using the new training set. At t = 1, please find the best decision stump f1 (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f1. What are the corresponding 1 and 1 based on f1?
What to submit: Write down the triplet (s,b,d) of f1, and the values of . Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.4 Now you should see that Adaboost does learn something at t = 1 (i.e., 1 6= 0). Lets move on to t = 2. Please first write down w2(n),n {1,2,3,4} for each example. Then, please find the best decision stump f2 (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f2. What are the corresponding 2 and 2 based on f2?
What to submit: Write down w2(n),n {1,2,3,4}, the triplet (s,b,d) of f2, and the values of. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.5 Lets move on to t = 3. Please first write down w3(n),n {1,2,3,4} for each example. Then, please find the best decision stump f3 (i.e., solve step (a) in Algorithm 1). If there are multiple equally best stump functions, just randomly pick ONE of them to be f3. What are the corresponding based on f3?
What to submit: Write down w3(n),n {1,2,3,4}, the triplet (s,b,d) of f3, and the values of. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.6 Please combine your results of Q3.3, Q3.4, and Q3.5, and write down your learned F(x) in terms of h(s,b,d), 1, 2, and 3. You should plug VALUES into s,b,d,1,2,3. For example,
F(x) = sign[0.5 h(+1,0.5,1)(x) + 0.5 h(1,0.5,1)(x) + 0.5 h(1,2,2)(x)]. Then please write down the predicted labels of x1,x2,x3,x4 using F(x). How many training examples are correctly labeled by F(x)?
What to submit: Write down F(x), and compute F(x1),F(x2),F(x3),F(x4), and the number of correctly labeled examples. Please round your results to two decimal places (e.g., 1.499 becomes 1.50).
Q3.7 From Q3.6, you should see that Adaboost with decision stumps can solve a problem similar to XOR in 3 iterations, by using a simple linear transformation (more precisely, 2-dimensional rotation). Note that the decision boundary by F(x) is nonlinear, even though that h H can only produce a linear boundary. Also note that, even with the same feature transform, a linear classifier such as logistic regression cannot correctly label every training example in our case.
What to submit: Nothing.
Programming component
4 High-level descriptions
4.1 Dataset
We will use mnist subset (images of handwritten digits from 0 to 9). This is the same subset of the full MNIST that we used for Homework 1 and Homework 2. As before, the dataset is stored in a JSON-formated file mnist subset.json. You can access its training, validation, and test splits using the keys train, valid, and test, respectively. For example, suppose we load mnist subset.json to the variable x. Then, x[0train0] refers to the training set of mnist subset. This set is a list with two elements: x[0train0][0] containing the features of size N (samples) D (dimension of features), and x[0train0][1] containing the corresponding labels of size N.
4.2 Tasks
You will be asked to implement the linear support vector machine (SVM) for binary classification
(Sect. 5). Specifically, you will
- finish implementing the following three functionsobjective function, pegasos train, and pegasos testin py. Refer to pegasos.py and Sect. 5 for more information.
- Run the script sh after you finish your implementation. This will output pegasos.json.
- add, commit, and push (1) the py, and (2) the pegasos.json file that you have created.
As in the previous homework, you are not responsible for loading/pre-processing data; we have done that for you (e.g., we have pre-processed the data to have to class: digits 04 and digits 59.). For specific instructions, please refer to text in Sect. 5 and the instructions in pegasos.py.
4.3 Cautions
Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required files, including your code and generated output files.
5 Pegasos: a stochastic gradient based solver for linear SVM
In this question, you will build a linear SVM classifier using the Pegasos algorithm [1]. Given a
training set, the primal formulation of linear SVM is as follows
. (7)
Instead of turning it into dual formulation, we are going to solve the primal formulation directly with a gradient-base algorithm. Note that here we include the bias term b into parameter w by appending x with 1.
In (batch) gradient descent, at each iteration of parameter update, we compute the gradients for all data points and take the average (or sum). When the training set is large (i.e., N is a large number), this is often too computationally expensive (for example, a too large chunk of data cannot be held in memory). Stochastic gradient descent with mini-batch alleviates this issue by computing the gradient on a subset of the data at each iteration.
One key issue of using (stochastic) gradient descent to solve eq. (7) is that max{0,z} is not differentiable at z = 0. You have seen this issue in Homework 2, where we use the Heaviside function to deal with it. In this question, you are going to learn and implement Pegasos, a representative solver of eq. (7) that applies stochastic gradient descent with mini-batch and explicitly takes care of the non-differentiable issue.
The pseudocode of Pegasos is given in Algorithm 2. At the t-th iteration of parameter update, we first take a mini-batch of data At of size K [step (a)], and define A+t At to be the subset of samples for which wt suffers a non-zero loss [step (b)]. Next we set the learning rate t = 1/(t) [step (c)]. We then perform a two-step parameter update as follows. We first compute (1t)wt and for all samples (x,y) A+t we add the vector x to (1t)wt. We denote the resulting vector by w [step (d)]. This step can also be written as wt+1 = wt tt where
2
x
The definition of the hinge loss, together with the Heaviside function, implies that t is the gradient of the objective function on the mini-batch At at wt. Last, we set wt+1 to be the projection of wt+1 onto the set
2
B = {w : ||w|| 1/ }
1/
This is obtained by scaling wt+1 by min{1,} [step (e)]. For details of Pegasos algorithm you may refer to the original paper [1].
Input A training set, the total number of iterations T, the batch size K, and the regularization parameter .
Output The last weight wT+1.
Initialization Choose w1 s.t. .
Algorithm 2: The Pegasos algorithm
Now you are going to implement Pegasos and train a binary linear SVM classifier on the dataset mnist subset.json. You are going to implement three functionsobjective function, pegasos train, and pegasos testin a script named pegasos.py. You will find detailed programming instructions in the script.
Q5.1 Finish the implementation of the function objective function, which corresponds to the objective function in the primal formulation of SVM.
Q5.2 Finish the implementation of the function pegasos train, which corresponds to Algorithm
2.
Q5.3 After you train your model, run your classifier on the test set and report the accuracy, which is defined as:
# of correctly classified test samples
# of test samples
Finish the implementation of the function pegasos test.
Q5.4 After you complete above steps, run pegasos.sh, which will run the Pegasos algorithm for 500 iterations with 6 settings (mini-batch size K = 100 with different {0.01,0.1,1} and = 0.1 with different K {1,10,1000}), and output a pegasos.json that records the test accuracy and the value of objective function at each iteration during the training process.
What to do and submit: run script pegasos.sh. It will generate pegasos.json. Add, commit, and push both pegasos.py and pegasos.json before the due date.
Q5.5 Based on pegasos.json, you are encouraged to make plots of the objective function value versus the number of iterations (i.e., a convergence curve) in different settings of and k, but you are not required to submit these plots.
Reviews
There are no reviews yet.