(Absolute Loss Regression with Sparsity) The absolute loss regression problem with `1 regularization is
N
wopt = argminX|yn w>xn| + ||w||1
w
n=1
where is the absolute value function, and > 0 is the regularization hyperparameter.
Is the above objective function convex? You dont need to prove this formally; just a brief reasoning based on properties of other functions that are known to be convex/non-convex would be fine.
Derivate the expression for the (sub)gradient vector for this model.
Problem 2
(Feature Masking as Regularization) Consider linear regression model by minimizing the squared loss function. Suppose we decide to mask out or drop each feature xnd of each input xn RD, independently, with probability 1 p (equivalently, retaining the feature with probability p). Masking or dropping out basically means that we will set the feature xnd to 0 with probability 1 p. Essentially, it would be equivalent to replacing each input xn by xn = xn mn, where denotes elementwise product and mn denotes the D1 binary mask vector with mnd Bernoulli(p) (mnd = 1 means the feature xnd was retained; mnd = 0 means the feature xnd was masked/zeroed).
Let us now define a new loss function using these masked inputs as follows: . Show that minimizing the expected value of this new loss function (where the expectation is used since the mask vectors mn are random) is equivalent to minimizing a regularized loss function. Clearly write down the expression of this regularized loss function.
Problem 3
(Multi-output Regression with Reduced Number of Parameters) Consider the multi-output regression in which each output yn RM in a real-valued vector, rather than a scalar. Assuming a linear model, we can model the outputs as Y XW, where X is the N D feature matrix and Y is N M response matrix with row n being yn> (note that each column of Y denotes one of the M responses), and W is the D M weight matrix, with its M columns containing the M weight vectors w1,w2,,wM. Lets define a squared error loss function, which is just the usual squared error but summed over all the M outputs.
Firstly, verify that this can also be written in a more compact notation as TRACE[(Y XW)>(Y XW)].
Further, we will assume that the weight matrix W can be written as a product of two matrices, i.e., W = BS where B is D K and S is K M (assume K < min{D,M}). Note that there is a benefit of modeling W this way, since now we need to learn only K (D + M) parameters as opposed to D M parameters and, if K is small, this can significantly reduce the number of parameters (in fact, reducing the effective number of parameters to be learned is another way of regularizing a machine learning model). Note (you can verify) that in this formulation, each wm can be written as a linear combination of K columns of B.
With the proposed representation of W, the new objective will be TRACE[(Y XBS)>(Y XBS)] and you need to learn both B and S by solving the following problem:
{B,S} = argminTRACE[(Y XBS)>(Y XBS)]
B,S
We will ignore regularization on B and S for brevity/simplicity.
Derive an alternating optimization (ALT-OPT) algorithm to learn B and S, clearly writing down the expressions for the updates of B and S. Are both subproblems (solving for B and solving for S) equally easy/difficult in this ALT-OPT algorithm? If yes, why? If no, why not?
Note: Since B and S are matrices, if you want, please feel free to use results for matrix derivatives (results you will need can be found in Sec. 2.5 of the Matrix Cookbook). However, the problem can be solved even without using matrix derivative results with some rearragement of terms and using vector derivatives.
Problem 4
Ridge Regression using Newtons Method Consider the ridge regression problem:
1 > Xw) +w w = argmin = argmin (y Xw) (y ww 2
where X is the N D feature matrix and y is the N 1 vector of labels of the N training examples. Note that the factor of has been used in the above expression just for convenience of derivations required for this problem and does not change the solution to the problem.
Derive the Newtons methods update equations for each iteration. For this model, how many iterations would the Newtons method will take to converge?
Problem 5
(Dice Roll) You have a six-faced dice which you roll N times and record the number of times each of its six faces are observed. Suppose these numbers are N1,N2,,N6, respectively. Assume that the probability of a random roll of the dice showning the kth face (k = 1,2,,6) to be equal to k (0,1).
Assuming an appropriate conjugate prior for the probability vector = [1,2,,6], derive its MAP estimate. In which situation(s), you would expect the MAP solution to be better than the MLE solution?
Also derive the full posterior distribution over using the same prior that you used for MAP estimate. Given this posterior, can you get the MLE and MAP estimate without solving the MLE and MAP optimization problems? If yes, how? If no, why not?

![[Solved] CS771A Homework1](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] CS771A Assignment- Second-Order Optimization for Logistic Regression](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.