[Solved] CS771A Homework2

$25

File Name: CS771A_Homework2.zip
File Size: 150.72 KB

SKU: [Solved] CS771A Homework2 Category: Tag:
5/5 - (1 vote)

(Second-Order Optimization for Logistic Regression) Show that, for the logistic regression model (assuming each label yn {0,1}, and no regularization) with loss function

exp(w>xn))), iteration t of a second-order optimization based update w(t+1) = w(t) H(t)1g(t), where H denotes the Hessian and g denotes the gradient, reduces to solving an importance-weighted regression problem of the form w(t+1) = argminw, where n denotes the importance of the nth training example and yn denotes a modified real-valued label. Also, clearly write down the expression for both, and provide a brief justification as to why the expression of n makes intuitive sense here.

Problem 2

(Perceptron with Kernels) We have seen that, due to the form of Perceptron updates w = w + ynxn (ignore the bias b), the weight vector learned by Perceptron can be written as w , where n is the number of times Perceptron makes a mistake on example n. Suppose our goal is to make Perceptron learn nonlinear boundaries, using a kernel k with feature map . Modify the standard Perceptron algorithm to do this. In particular, for this kernelized variant of the Perceptron algorithm (1) Give the initialization, (2) Give the mistake condition, and (3) Give the update equation.

Problem 3

(SVM with Unequal Class Importance) Sometimes it costs us a lot more to classify negative points as positive than positive points as negative. (for instance, if we are predicting if someone has cancer then we would rather err on the side of caution (predicting yes when the answer is no) than vice versa). One way of expressing this in the support vector machine model is to assign different costs to the two kinds of mis-classification. The primal formulation of this is:

min w,b,

subject to yn(wT xn + b) 1 n and n 0, n.

The only difference is that instead of one cost parameter C, there are two, C+1 and C1, representing the costs of misclassifying positive examples and misclassifying negative examples, respectively.

Write down the Lagrangian problem of this modified SVM. Take derivatives w.r.t. the primal variables and construct the dual, namely, the maximization problem that depends only on the dual variables , rather than the primal variables. In your final PDF write-up, you need not give each of every step in these derivations (e.g., standard steps of substituting and eliminating some variables) but do write down the key steps. Explain (intuitively) how this differs from the standard SVM dual problem; in particular, how the C variables differ between the two duals.

Problem 4

(SGD for K-means Objective) Recall the K-means objective function: .

As we have seen, the K means algorithm minimizes this objective by taking a greedy iterative approach of assigning each point to its closest center (finding the znks) and updating the cluster means . The standard K-means algorithm is a batch algorithm and uses all the data in every iteration. It however can be made online by taking a random example xn at a time, and then (1) assigning xn greedily to the best cluster, and (2) updating the cluster means using SGD on the objective L. Assuming you have initialized randomly and are reading one data point xn at a time,

  • How would you solve step 1?
  • What will be the SGD-based cluster mean update equations for step 2? Intuitively, why does the update equation make sense?
  • Note that the SGD update requires a step size. For your derived SGD update, suggest a good choice of the step size (and mention why you think it is a good choice).

Problem 5

(Kernel K-means) Assuming a kernel k with an infinite dimensional feature map (e.g., an RBF kernel), we can neither store the kernel-induced feature map representation of the data points nor can store the cluster means in the kernel-induced feature space. How can we still implement the kernel K-means algorithm in practice? Justify your answer by sketching the algorithm, showing all the steps (initialization, cluster assignment, mean computation), clearly giving the mathematical operations in each. In particular, what is the difference between how the clusters means would need to be stored in kernel K-means versus how they are stored in standard Kmeans? Finally, assuming each input to be D-dimensional in the original feature space, and N to be the number of inputs, how does kernel K-means compare with standard K-means in terms of the cost of input to cluster mean distance calculation (please answer this using the big O notation)?

Problem 6 (Programming Problem,

Part 1: You are provided a dataset in the file binclass.txt. In this file, the first two numbers on each line denote the two features of the input xn, and the third number is the binary label yn {1,+1}.

Implement a generative classification model for this data assuming Gaussian class-conditional distributions of the positive and negative class examples to be and N(x|,2 I2), respectively. Note that here I2 denotes a 2 2 identity matrix. Assume the class-marginal to be p(yn = 1) = 0.5, and use MLE estimates for the unknown parameters. Your implementation need not be specific to two-dimensional inputs and it should be almost equally easy to implement it such that it works for any number of features (but it is okay if your implementation is specific to two-dimensional inputs only).

On a two-dimensional plane, plot the examples from both the classes (use red color for positives and blue color for negatives) and the learned decision boundary for this model. Note that we are not providing any separate test data. Your task is only to learn the decision boundary using the provided training data and visualize it.

Next, repeat the same exercise but assuming the Gaussian class-conditional distributions of the positive and negative class examples to be N(x|+,2I2) and N(x|,2I2), respectively.

Finally, try out an SVM classifier (with linear kernel) on this data (weve also provided the data in the format libSVM requires) and show the learn decision boundary. For this part, you do not need to implement SVM. There are many nice implementations of SVM available, such as the one in scikit-learn and the very popular libSVM toolkit. Assume the C (or ) hyperparameter of SVM in these implementations to be 1.

Part2: Repeat the same experiments as you did for part 1 but now using a different dataset binclassv2.txt. Looking at the results of both the parts, which of the two models (generative classification with Gaussian classconditional and SVM) do you think seems to work better for each of these datasets, and in general?

Deliverables: Include your plots (use a separate, appropriately labeled plot, for each case) and experimental findings in the main writeup PDF. Submit your codes in a separate zip file on the provided Dropbox link. Please comment the code so that it is easy to read and also provide a README that briefly explains how to run the code. For the SVM part, you do not have to submit any code but do include the plots in the PDF (and mention the software used scikit-learn or libSVM).

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CS771A Homework2[Solved] CS771A Homework2
$25