1 Bias-Variance Tradeo
Consider a dataset with n data points (xi,yi), xi 2Rp1, drawn from the following linear model:
y = x> ? + ,
where is a Gaussian noise and the star sign is used to dierentiate the true parameter from the estimators that will be introduced later. Consider the L2 regularized linear regression as follows:
b = argmin,
whereeach row. Properties of an0 is the regularization parameter. Leta ne transformation Xof a gaussian random variable will be useful throughout2Rnp denote the matrix obtained by stackingin this problem.
- Find the closed form solution for and its distribution.
- Calculate the bias term E[x> b] bxb> ? as a function ofb and some fixed test point x.
2
- Calculate the variance term E x> E[x> ] as a function of and some fixed test point x.
- Use the results from parts (b) and (c) and the biasvariance theorem to analyze the impact of in the squared error. Specifically, which term dominates when is small or large?
2 Kernelized Perceptron
Given a set of training samples (learns a weight vector w by iterating through all training samples. For eachx1,y1),(x2,y2), ,(xN,yN) where y 2 { x1i,, if the prediction is incorrect,1}, the Perceptron algorithm
we update w by w w + yixi. Now we would like to, and we want to learn a new weight vectorkernelize the Perceptron algorithm. Assume we mapw that makes x to (x) through a nonlinear feature mapping prediction by y = sign(w>(x)). Further assume that we initial the algorithm with w = 0.
- Show that w is always a linear combination of feature vectors, i.e. w
- Show that while the update rule for w for a kernelized Perceptron does depend on the explicit feature mapping (x), the prediction can be re-expressed and thus depends only on the inner products between nonlinear transformed features.
- Show that we do not need to explicitly store w at training or test time. Instead, we can implicitly use it by maintaining all the i. Please give the outline of the algorithm that would allow us to not store w. You should indicate how i is initialized, when to update i, and how it is updated.
3 Kernels
Mercers theorem implies that a bivariate function k(,) is a positive definite kernel function i, for any N and any x1,x2, ,xN, the corresponding kernel matrix K is positive semidefinite, where Kij = k(xi,xj). Recall that a matrix A 2Rnn is positive semidefinite if all of its eigenvalues are non-negative, or equivalently, if x>Ax 0 for arbitrary vector x 2Rn[1].
Suppose k1(,) and k2(,) are positive definite kernel functions with corresponding kernel matrices K1 and K2. Use Mercers theorem to show that the following kernel functions are positive definite.
- K3 = a1K1 + a2K2, for a1,a2
- K4 defined by k4(x,x0) = f(x)f(x0) where f() is an arbitrary real valued function.
- K5 defined by k5(x,x0) = k1(x,x0)k2(x,x0).
4 Soft Margin Hyperplanes
The function of the slack variables used in the optimization problem for soft margin hyperplanes has the form:. Instead, we could use, with p > 1.
- Give the dual formulation of the problem in this general case.
- How does this more general formulation (p > 1) compare to the standard setting (p = 1) discussed in lecture? Is the general formulation more or less complex? Justify your answer.
5 Programming
In this problem, you will experiment with SVMs on a real-world dataset. You will implement a linear SVM (i.e., an SVM using the original features. You will also use a widely used SVM toolbox called LibSVM to experiment with kernel SVMs.
Dataset: We have provided the Splice Dataset from UCIs machine learning data repository.1 The provided binary classification dataset has 60 input features, and the training and test sets contain 1,000 and 2,175 samples, respectively (the files are called splice train.mat and splice test.mat).
5.1 Data preprocessing
Preprocess the training and test data by
- computing the mean of each dimension and subtracting it from each dimension
- dividing each dimension by its standard deviation
Notice that the mean and standard deviation should be estimated from the training data and then applied to both datasets. Explain why this is the case. Also, report the mean and the standard deviation of the third and 10th features on the test data.
5.2 Implement linear SVM
Please fill in the Matlab functions trainsvm in trainsvm.m and testsvm.m in testsvm.m.
The input of trainsvm contain training feature vectors and labels, as well as the tradeo parameter C. The output of trainsvm contain the SVM parameters (weight vector and bias). In your implementation, you need to solve SVM in its primal form
,b,
s.t. i,8i
i 0,8i
Please use the quadprog function in Matlab to solve the above quadratic problem.
For testsvm, the input contains testing feature vectors and labels, as well as SVM parameters. The output contains the test accuracy.
5.3 Cross validation for linear SVM
Use 5-fold cross validation to select the optimal C for your implementation of linear SVM.
- Report the cross-valiation accuracy (averaged accuracy over each validation set) and average training time (averaged over each training subset) on dierent C taken from {4 6,4 5, ,4,42}. How does the value of C aect the cross validation accuracy and average training time? Explain your observation.
- Which C do you choose based on the cross validation results?
- For the selected C, report the test accuracy.
5.4 Use linear SVM in LibSVM
LibSVM is widely used toolbox for SVMs, and it has a Matlab interface. Download LibSVM from http: //www.csie.ntu.edu.tw/~cjlin/libsvm/ and install it according to the README file (make sure to use the Matlab interface provided in the LibSVM toolbox). For each C from {4 6,4 5, ,4,42}, apply 5-fold cross validation (use -v option in LibSVM) and report the cross validation accuracy and average training time.
- Is the cross validation accuracy the same as that in 3? Note that LibSVM solves linear SVM in dual form while your implementation does it in primal form.
- How does LibSVM compare with your implementation in terms of training time?
5.5 Use kernel SVM in LibSVM
LibSVM supports a number of kernel types. Here you need to experiment with the polynomial kernel and RBF (Radial Basis Function) kernel.
- Polynomial kernel. Please tune C and degree in the kernel. For each combination of (C, degree), where C 2 {4 3,4 4, ,46,47} and degree 2 {1,2,3}, report the 5-fold cross validation accuracy and average training time.
- RBF kernel. Please tune C and gamma in the kernel. For each combination of (C, gamma), where C 2 {4 3,4 4, ,46,47} and gamma 2 {4 7,4 6, ,4 1,4 2}, report the 5-fold cross validation accuracy and average training time.
Based on the cross validation results of Polynomial and RBF kernel, which kernel type and kernel parameters will you choose? Report the corresponding test accuracy for the configuration with the highest cross validation accuracy.
Reviews
There are no reviews yet.