[Solved] CS771 -Homework 1

$25

File Name: CS771_-Homework_1.zip
File Size: 160.14 KB

SKU: [Solved] CS771 -Homework 1 Category: Tag:
5/5 - (1 vote)

Machine Learning )

Instructions:

  • Only electronic submissions will be accepted. Your main PDF writeup must be typeset in LaTeX (please also refer to the Additional Instructions below).
  • The PDF writeup containing your solution has to be submitted via Gradescope https://www.gradescope. com/.
  • We have created your Gradescope account (you should have received the notification). Please use yourIITK CC ID (not any other email ID) to login. Use the Forgot Password option to set your password.

Additional Instructions

  • We have provided a LaTeX template file tex to help typeset your PDF writeup. There is also a style file ml.sty that contain shortcuts to many of the useful LaTeX commends for doing things such as boldfaced/calligraphic fonts for letters, various mathematical/greek symbols, etc., and others. Use of these shortcuts is recommended (but not necessary).
  • Your answer to every question should begin on a new page. The provided template is designed to do this However, if it fails to do so, use the clearpage option in LaTeX before starting the answer to a new question, to enforce this.
  • While submitting your assignment on the Gradescope website, you will have to specify on which page(s) is question 1 answered, on which page(s) is question 2 answered etc. To do this properly, first ensure that the answer to each question starts on a different page.
  • Be careful to flush all your floats (figures, tables) corresponding to question n before starting the answer to question n + 1 otherwise, while grading, we might miss your important parts of your answers.
  • Your solutions must appear in proper order in the PDF file i.e. solution to question n must be complete in the PDF file (including all plots, tables, proofs etc) before you present a solution to question n + 1.

Problem 1 (15 marks)

(Absolute Loss Regression with Sparsity) The absolute loss regression problem with i1 regularization is

wopt = argminW ~Nn=1 |yn wTxn| + A||w||1

where ||w||1 = >Dd=1 |wd|, |.| is the absolute value function, and A > 0 is the regularization hyperparameter.

Is the above objective function convex? You dont need to prove this formally; just a brief reasoning based on properties of other functions that are known to be convex/non-convex would be fine.

Derivate the expression for the (sub)gradient vector for this model.

Problem 2 (15 marks)

(Feature Masking as Regularization) Consider linear regression model by minimizing the squared loss function >Nn=1(yn wTxn)2. Suppose we decide to mask out or drop each feature Xnd of each input xn RD, independently, with probability 1 p (equivalently, retaining the feature with probability p). Masking or dropping out basically means that we will set the feature Xnd to 0 with probability 1 p. Essentially, it would be equivalent to replacing each input xn by xn = xn mn, where denotes elementwise product and mn denotes the D 1 binary mask vector with mnd Bernoulli(p) (mnd = 1 means the feature Xnd was retained; mnd = 0 means the feature Xnd was masked/zeroed).

Let us now define a new loss function using these masked inputs as follows: >N n=1(yn wTxn)2. Show that minimizing the expected value of this new loss function (where the expectation is used since the mask vectors mn are random) is equivalent to minimizing a regularized loss function. Clearly write down the expression of this regularized loss function.

Problem 3 (40 marks)

(Multi-output Regression with Reduced Number of Parameters) Consider the multi-output regression in which each output yn RM in a real-valued vector, rather than a scalar. Assuming a linear model, we can model the outputs as Y XW, where X is the N D feature matrix and Y is N M response matrix with row n being yTn (note that each column of Y denotes one of the M responses), and W is the D M weight matrix, with its M columns containing the M weight vectors w1, w2,. . . , wM. Lets define a squared error loss function >j=1>=1(ynmwxn)2, which is just the usual squared error but summed over all the M outputs. Firstly, verify that this can also be written in a more compact notation as TRACE[(Y XW)T(Y XW)].

Further, we will assume that the weight matrix W can be written as a product of two matrices, i.e., W = BS where B is D K and S is K M (assume K < min{D, M}). Note that there is a benefit of modeling W this way, since now we need to learn only K (D + M) parameters as opposed to D M parameters and, if K is small, this can significantly reduce the number of parameters (in fact, reducing the effective number of parameters to be learned is another way of regularizing a machine learning model). Note (you can verify) that in this formulation, each wm can be written as a linear combination of K columns of B.

With the proposed representation of W, the new objective will be TRACE[(Y XBS)T(Y XBS)] and you need to learn both B and S by solving the following problem:

{ B, S}= arg minB,S TRACE[(Y XBS)T(Y XBS)]

We will ignore regularization on B and S for brevity/simplicity.

Derive an alternating optimization (ALT-OPT) algorithm to learn B and S, clearly writing down the expressions for the updates of B and S. Are both subproblems (solving for B and solving for S) equally easy/difficult in this ALT-OPT algorithm? If yes, why? If no, why not?

Note: Since B and S are matrices, if you want, please feel free to use results for matrix derivatives (results you will need can be found in Sec. 2.5 of the Matrix Cookbook). However, the problem can be solved even without using matrix derivative results with some rearragement of terms and using vector derivatives.

Problem 4 (10 marks)

Ridge Regression using Newtons Method Consider the ridge regression problem:

w = arg minw 12 ~Nn=1 (yn wTxn)2 + 2 wTw = arg min2(y Xw)T(y Xw) + 1 2 wTww

where X is the N x D feature matrix and y is the N x 1 vector of labels of the N training examples. Note that the factor of 2 1has been used in the above expression just for convenience of derivations required for this problem and does not change the solution to the problem.

Derive the Newtons methods update equations for each iteration. For this model, how many iterations would the Newtons method will take to converge?

Problem 5 (20 marks)

(Dice Roll) You have a six-faced dice which you roll N times and record the number of times each of its six faces are observed. Suppose these numbers are N1, N2,. . . , N6, respectively. Assume that the probability of a random roll of the dice showning the kth face (k = 1,2,. . . , 6) to be equal to 7rk E (0, 1).

Assuming an appropriate conjugate prior for the probability vector = [7r1, 7r2, . . . , 7r6], derive its MAP estimate. In which situation(s), you would expect the MAP solution to be better than the MLE solution?

Also derive the full posterior distribution over using the same prior that you used for MAP estimate. Given this posterior, can you get the MLE and MAP estimate without solving the MLE and MAP optimization problems? If yes, how? If no, why not?

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CS771 -Homework 1
$25