5/5 - (1 vote)

Randomized Gauss-Newton Algorithm for Training Neural Networks In this problem, you will implement randomized Gauss-Newton (GN) algorithm to train a neural network. We will consider a single layer architecture and a squared loss training objective. This special case is also referred to as a nonlinear least squares problem. Consider a set of training data, where x_iR^dis the i^thdata sample and y_i{0,1} is the

corresponding label. Then, use the following nonlinear function

to fit the given data labels. In order to estimate w, we can minimize the sum of squares as follows

w=argmin^X((w^Tx_i) y_i)²

i=1

= argmin

where wdenotes the optimal parameter vector, y is the vector containing labels y₁,,y_n, and the vector output function f(w) is defined as

The above problem is non-convex unlike the ordinary least squares and logistic regression problems.

Download the provided MNIST dataset (mnist all.mat), where you will use only the samples belonging to digit 0 (train0) and digit 1(train1). Next, we will apply the Gauss-Newton (GN) heuristic to approximately solve the non-convex optimization roblem. In the GN algorithm, you first need to linearize the nonlinear vector output function f(w). The first order expansion of f(w) around the current estimate, w_kis given by

f(w) f(w_k) + J^k(w w_k) (1)

where J^kRⁿ^dis the Jacobian matrix at the iteration k whose ij^thentry is defined as

Then, Gauss-Newton algorithm performs the update

w^k⁺¹=argmin . (2) w

Find the Gauss-Newton update (2) explicitly for this problem by deriving the Jacobian. Describe how the sub-problems can be solved. What is the computational complexity per iteration for n d data matrices?

One can incorporate a step-size (0,1] in the above update as follows.

w^k⁺¹= (1 )w^k+ d^k

where d^k:= argmin . (3) w

This is referred to as damped Gauss-Newton algorithm.

Implement the damped Gauss Newton algorithm on the MNIST dataset to classify digits 0 and 1. Find a reasonable step-size for fast convergence by trial and error. Plot the training error, i.e., , as a function of the iteration index.
You will now apply sampling to reduce computational complexity of the GN algorithm. Apply uniform sampling to the rows of the Least-Squares sub-problem (3) at each iteration. Plot the training error, i.e., , as a function of the iteration index. Repeat this procedure for row norm scores sampling.

Randomized Kaczmarz Algorithm

In this question, you will implement the Randomized Kaczmarz Algorithm (see Lecture 16 slides) on the YearPredictionMSD dataset, which can be downloaded from https://archive. ics.uci.edu/ml/datasets/YearPredictionMSD . In this file, there are 463,715 training and 51,630 test samples (songs) with 90 audio features and your goal is to predict the release year of each song using least squares regression. After downloading the file, you can import and partition the data using the following command in Python

import numpy as np file_name=YearPredictionMSD.txt file = open(file_name, r)

data=[] for line in file.readlines():

fname = line.rstrip().split(,) data.append(fname)

data_arr=np.asarray(data).astype(float) A=data_arr

b=(data_arr[:,0]-np.min(data_arr[:,0]))/(np.max(data_arr[:,0])-np.min(data_arr[:,0])) You may also normalize the data matrix as follows

meanA=np.mean(A,axis=0) stdA=np.std(A,axis=0) A=(A-meanA)/stdA

Plot the training and test cost vs iteration curves for Randomized Kaczmarz Algorithm (i.e., the one with optimal sampling distribution) and the randomized algorithm with uniform sampling distribution on the same figure. For the uniformly sampled version, you should use an explicit learning rate and tune this parameter by trial and error.
Note that the analysis of the Randomized Kaczmarz Algorithm assumes that the linear system Ax = b is consistent, which is may not be valid in our case. Show that an optimal solution of the least squares problem min can be found via solving

the augmented linear system .

Since the augmented linear system in part (b) is consistent, we may apply Randomized Kaczmarz Algorithm. Repeat part (a) using Randomized Kaczmarz algorithm on the augmented linear system.
Repeat part (a) using SGD with diminishing step-sizes (see page 11 of Lecture 16 slides) by tuning the initial step size.

Randomized Low-Rank Approximation and Randomized SVD

In this problem, you will implement a randomized approach in order to obtain a low-rank approximation for a given data matrix. Assume that you are given a data matrix A

Rⁿ^dwith UV ^Tas its SVD. The best rank-k approximation for the data matrix is A_k=

. However, since this is computationally expensive, you need to use

another approach to reduce the complexity.

First, you will generate a 50001000 data matrix with decaying singular values. To construct such a matrix, you can first generate a random Gaussian matrix with zero mean and identity covariance. Then, compute the SVD of this matrix as UV ^T. Now, replace with a diagonal matrix with decaying entries,_ii= i². Then, you can construct the data matrix with decaying singular values as A = UV ^T.

One approach to obtain a rank-k approximation is as follows.
- Obtain a random approximation for the data matrix A as C by uniformly sampling k columns and scaling appropriately.
- Verify the approximation by computing the relative error kAA^TCC^Tk_F/kAA^Tk_F.
- Compute the randomized rank-k approximation defined as A_k= CCA.

Plot the approximation error, i.e., , as a function of the rank k. Verify the error bound we have in Lecture 19 slides, i.e., , and find the numerical value of . Repeat this procedure for uniform sampling, column norm score sampling, Gaussian sketch, +1/ 1 i.i.d. sketch, and randomized Hadamard (FJLT) sketch.

Now we use the low rank approximation in part (a) to produce an approximate Singular Value Decomposition of A. The Randomized SVD algorithm is as follows.
- Generate a sketching matrix S.
- Compute C = AS
- Calculate QR decomposition: C = AS = QR
- Calculate the SVD of Q^TA, i.e., Q^TA = UV ^T
- Approximate SVD of A is given by A (QU)V ^T.
- Note that the randomized rank-k approximation A_k:= (QU)V ^T= QQ^TA = CCA as in part (a)

Plot the approximation error, i.e., , as a function of the rank k. To compare with the exact SVD, plot , where A_kis the best rank k approximation of A using the SVD of A. Repeat the whole procedure for uniform sampling, column norm score sampling, Gaussian sketch, +1/1 i.i.d. sketch, and randomized Hadamard (FJLT) sketch.

CUR decomposition

CUR decomposition is a dimensionality reduction and low-rank approximation method in a similar spirit to Singular Value Decomposition (SVD). One particular difficulty of SVD is that the left and right singular vectors are difficult to interpret since they lack any direct meaning in terms of the original data. On the other hand, CUR decomposition is interpretable since it involves small subsets of the rows/columns of the original data.

Suppose that A is an m n matrix of approximate rank k, and that we have two column/row subsampled approximations

C := A(:,J_S)	(4)
R := A(I_S,🙂	(5)

where I_S,J_Sare appropriate subsets. Then we can approximate the matrix A via

A CCARR = CUR, (6)

where U := CARand the superscript denotes the pseudoinverse operation. We choose set of columns C and a set of rows R, which play the role of U and V in SVD. We may pick any number of rows and columns. Therefore, this factorization provides an interpretable alternative to the Singular Value Decomposition. Furthermore, CUR has computational advantages especially for sparse matrices. Particularly, since CUR directly select random rows and columns C and R are sparse for a sparse matrix A, whereas U and V in SVD can still be dense matrices.

You will be using the movie ratings dataset from the https://movielens.org/ website provided at https://grouplens.org/datasets/movielens/100k/. The dataset contains 100,000 ratings from 943 users on 1682 movies. The usermovies matrix of dimension 943 x 1682 is very sparse.

For m = 1,,10, apply uniform row/column subsampling with sample size m to obtain C and R matrices of rank m and solve U to obtain the CUR decomposition. Plot the approximation error in Frobenius norm and spectral norm as a function of m.
Repeat part (a) using `₂row/column norm scores for row/column subsampling respectively.
Repeat part (a) using leverage scores of A and A^Tfor row/column subsampling respectively.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] (EE270) Large Scale Matrix Computation, Optimization and Learning -HW #4

Randomized Kaczmarz Algorithm

Randomized Low-Rank Approximation and Randomized SVD

CUR decomposition

A CCARR = CUR, (6)

Reviews

Whatsapp Us

[Solved] (EE270) Large Scale Matrix Computation, Optimization and Learning -HW #4

Randomized Kaczmarz Algorithm

Randomized Low-Rank Approximation and Randomized SVD

CUR decomposition

A CCARR = CUR, (6)

Reviews

Related products

[Solved] (EE270)Large Scale Matrix Computation, Optimization and Learning-HW #3

[Solved] (EE270)Large Scale Matrix Computation, Optimization and Learning-HW #1

[Solved] (EE270) Large Scale Matrix Computation, Optimization and Learning-HW #2