5/5 - (1 vote)

(Absolute Loss Regression with Sparsity) The absolute loss regression problem with `₁regularization is

w_opt= argmin^X|y_n w^>x_n| + ||w||₁

n=1

where is the absolute value function, and > 0 is the regularization hyperparameter.

Is the above objective function convex? You dont need to prove this formally; just a brief reasoning based on properties of other functions that are known to be convex/non-convex would be fine.

Derivate the expression for the (sub)gradient vector for this model.

Problem 2

Name: [Solved] CS771A Homework1
Brand: Assignment Chef
SKU: [Solved] CS771A Homework1
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

(Feature Masking as Regularization) Consider linear regression model by minimizing the squared loss function. Suppose we decide to mask out or drop each feature x_ndof each input x_n R^D, independently, with probability 1 p (equivalently, retaining the feature with probability p). Masking or dropping out basically means that we will set the feature x_ndto 0 with probability 1 p. Essentially, it would be equivalent to replacing each input x_nby x_n= x_nm_n, where denotes elementwise product and m_ndenotes the D1 binary mask vector with m_nd Bernoulli(p) (m_nd= 1 means the feature x_ndwas retained; m_nd= 0 means the feature x_ndwas masked/zeroed).

Let us now define a new loss function using these masked inputs as follows: . Show that minimizing the expected value of this new loss function (where the expectation is used since the mask vectors m_nare random) is equivalent to minimizing a regularized loss function. Clearly write down the expression of this regularized loss function.

Problem 3

(Multi-output Regression with Reduced Number of Parameters) Consider the multi-output regression in which each output y_n R^Min a real-valued vector, rather than a scalar. Assuming a linear model, we can model the outputs as Y XW, where X is the N D feature matrix and Y is N M response matrix with row n being y_n^>(note that each column of Y denotes one of the M responses), and W is the D M weight matrix, with its M columns containing the M weight vectors w₁,w₂,,w_M. Lets define a squared error loss function, which is just the usual squared error but summed over all the M outputs.

Firstly, verify that this can also be written in a more compact notation as TRACE[(Y XW)^>(Y XW)].

Further, we will assume that the weight matrix W can be written as a product of two matrices, i.e., W = BS where B is D K and S is K M (assume K < min{D,M}). Note that there is a benefit of modeling W this way, since now we need to learn only K (D + M) parameters as opposed to D M parameters and, if K is small, this can significantly reduce the number of parameters (in fact, reducing the effective number of parameters to be learned is another way of regularizing a machine learning model). Note (you can verify) that in this formulation, each w_mcan be written as a linear combination of K columns of B.

With the proposed representation of W, the new objective will be TRACE[(Y XBS)^>(Y XBS)] and you need to learn both B and S by solving the following problem:

{B,S} = argminTRACE[(Y XBS)^>(Y XBS)]

B,S

We will ignore regularization on B and S for brevity/simplicity.

Derive an alternating optimization (ALT-OPT) algorithm to learn B and S, clearly writing down the expressions for the updates of B and S. Are both subproblems (solving for B and solving for S) equally easy/difficult in this ALT-OPT algorithm? If yes, why? If no, why not?

Note: Since B and S are matrices, if you want, please feel free to use results for matrix derivatives (results you will need can be found in Sec. 2.5 of the Matrix Cookbook). However, the problem can be solved even without using matrix derivative results with some rearragement of terms and using vector derivatives.

Problem 4

Ridge Regression using Newtons Method Consider the ridge regression problem:

1 ^> Xw) +w w = argmin = argmin (y Xw) (y ww 2

where X is the N D feature matrix and y is the N 1 vector of labels of the N training examples. Note that the factor of has been used in the above expression just for convenience of derivations required for this problem and does not change the solution to the problem.

Derive the Newtons methods update equations for each iteration. For this model, how many iterations would the Newtons method will take to converge?

Problem 5

(Dice Roll) You have a six-faced dice which you roll N times and record the number of times each of its six faces are observed. Suppose these numbers are N₁,N₂,,N₆, respectively. Assume that the probability of a random roll of the dice showning the k^thface (k = 1,2,,6) to be equal to _k (0,1).

Assuming an appropriate conjugate prior for the probability vector = [₁,₂,,₆], derive its MAP estimate. In which situation(s), you would expect the MAP solution to be better than the MLE solution?

Also derive the full posterior distribution over using the same prior that you used for MAP estimate. Given this posterior, can you get the MLE and MAP estimate without solving the MLE and MAP optimization problems? If yes, how? If no, why not?

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] CS771A Homework1

Problem 2

Problem 3

Problem 4

Problem 5

Reviews

Related products

[Solved] CS771A Assignment- Second-Order Optimization for Logistic Regression

[Solved] CS771A Homework2