- Let X = {x1,,xN} with xt RD,t = 1,,N be a given training set. Assume that the dataset is centered, i.e., . We focus on performing linear dimensionality reduction on the dataset using PCA (principal component analysis). With PCA, for each xt RD, we get zt = Wxt, where zt Rd,d < D, is the low dimensional projection, and W RdD is the PCA projection matrix. Let = be the sample covariance matrix. Further, let vt = WT zt so that vt RD.
- Professor HighLowHigh claims: vt = xt for all t = 1,,N. Is the claim correct? Clearly explain and prove your answer with necessary (mathematical) details.
- Professor HighLowHigh also claims:
N N N
Xkxtk22 Xkvtk22 = Xkxt vtk22 ,
t=1 t=1 t=1
where for a vector a . Is the claim correct? Clearly explain and prove your answer with necessary (mathematical) details.
- Let Z = {(x1,r1),,(xN,rN)},xt Rd,rt Rk be a set of N training samples. We consider training a multilayer perceptron as shown in Figure 1. We consider a general setting where the transfer functions at each stage are denoted by g, i.e.,
and ,
where ath,ati respectively denote the input activation for hidden node h and output node i. Further, let L(,) be the loss function, so that the learning focuses on minimizing:
N k
E(W,V |Z) = XXL(rit,yit) .
t=1 i=1
- Show that the stochastic gradient descent update for vi,h is of the form vi,hnew = vi,hold + vi,h, with the update
vi,h = tizht , where . (1)
- Show that the stochastic gradient descent update for wh,j is of the form wh,jnew = wh,jold + wh,j, with the update
wh,j = thxtj , where . (2)
Figure 1: Two layer perceptron.
Programming assignment:
The next problem involves programming. For Question 3, we will be using the 2-class classification datasets from Boston50 and Boston75. In particular, we will develop code for 2-class Support Vector Machines (SVMs) using gradient descent. The goal will be to modify your code for MyLogisticReg2 from HW3.
- We will develop code for 2-class SVMs with parameters (w,w0) where w Rd,w0 R. Assume a given dataset {(xt,yt),t = 1,,N}, where xt Rd and yt {1,1}. Recall from our discussion in class that training SVMs involves minimizing the following objective function:
. (3)
We will use = 5 in this assignment.
For reference, compare the objective function to that of regularized logistic regression which you recently worked on as part of HW3:
, (4)
where we had used = 0 for the HW3 code.
We will develop code for MySVM2 with corresponding MySVM2.fit(X,y) and MySVM2.predict(X) functions. Parameters for the model can be initialized following what you had done for MyLogisticReg2. In the fit function, the parameters will be estimated using mini-batch stochastic gradient descent with different mini-batch sizes m n. In particular, you will modify your MyLogisticReg2 code by using gradients for the SVM objective in (3) instead of the logistic regression objective in (4). Further, you will have to add the mini-batch stochastic gradient descent (SGD) functionality which, for a pre-specified mini-batch size m, picks m unique points at random to do the gradient descent in each iteration. We will run experiments with different values of m.
We will compare the performance of MySVM2 for different values of mini-batch size m with
LogisticRegression[1] on two datasets: Boston50 and Boston75. Recall that Boston has 506 data points, and a 5-fold cross-validation leaves n 400 points for training in each fold.[2]For mini-batch SGD, we will consider three different values of m:
(i) m = 40, which is 10% of the dataset in each fold for 5-fold cross-validation, (ii) m = 200, which is 50% of the dataset in each fold for 5-fold cross-validation, and (iii) m = n, which is the full dataset in each fold for 5-fold cross-validation.
Note that m = n uses the full dataset (available for that fold) in each iteration and hence corresponds to the usual gradient descent.[3]
Using my cross val with 5-fold cross-validation, report the error rates in each fold as well as the mean and standard deviation of error rates across all folds for the four methods: MySVM2 with m = 40,m = 200, and m = n, and LogisticRegression, applied to the two 2-class classification datasets: Boston50 and Boston75.
You will have to submit (a) code and (b) summary of results:
- Code: You will have to submit code for MySVM2() as well as a wrapper code q3().
For developing MySVM2(), you are encouraged to consult the code for MyLogisticReg2() from HW3. You need to make sure you have init , fit, and predict implemented in MySVM2. init (d,m) will initialize the parameters and will take the data dimensionality d and mini-batch size m as input. You can add additional inputs such as the step size or the convergence threshold. fit(X,y) will take the data features X and labels y and will use mini-batch SGD to estimate the parameters w,w0. predict(X) will take a feature matrix corresponding to the test set and return the predicted labels. Your class MySVM2() will NOT inherit any base class in sklearn.
The wrapper code (main file) has no input and is used to prepare the datasets, and make calls to my cross val(method,X,y,k) to generate the error rate results for each dataset and each method. The code for my cross val(method,X,y,k) must be yours (e.g., code you made in HW1 with modifications as needed) and you cannot use cross val score() in sklearn. The results should be printed to terminal (not generating an additional file in the folder). Make sure the calls to my cross val(method,X,y,k) are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:
- MySVM2 with m = 40 for Boston50;
- MySVM2 with m = 200 for Boston50;
- MySVM2 with m = n for Boston50; iv. LogisticRegression for Boston50;
- MySVM2 with m = 40 for Boston75;
- MySVM2 with m = 200 for Boston75;
- MySVM2 with m = n for Boston75;
- LogisticRegression for Boston75.
*For the wrapper code, you need to make a q3.py file for it, and one should be able to run your code by calling python q3.py in command line window.
- Summary of results: For each dataset and each method, report the test set error rates for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation of the error rates over the k Make a table to present the results for each method and each dataset (4 tables in total). Each column of the table represents a fold and add two columns at the end to show the overall mean error rate and standard deviation over the k folds. For example:
Error rates for MySVM2 with m = 40 for Boston50 | ||||||
Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean | SD |
# | # | # | # | # | # | # |
[1] You should use LogisticRegression from sklearn, as we did for HW3. Note that Linear SVMs are implemented in sklearn as LinearSVC, but we will not use it since we have not discussed it in class. We will stick to
LogisticRegression for comparisons.
[2] Note that we are denoting the number of training points available for training in each fold as n, which is smaller than the size of the full dataset.
[3] The exact value of n may differ mildly across the 5-folds since 506 cannot be exactly divided by 5. Your code for HW3 is already doing these splits, so this aspect should not need additional effort. In the code, the m = n needs to be passed as a special option (say, m = 106 or m =all) so the code knows it has use the the full dataset for that fold.
Reviews
There are no reviews yet.