- Randomized Gauss-Newton Algorithm for Training Neural Networks In this problem, you will implement randomized Gauss-Newton (GN) algorithm to train a neural network. We will consider a single layer architecture and a squared loss training objective. This special case is also referred to as a nonlinear least squares problem. Consider a set of training data, where xi Rd is the ith data sample and yi {0,1} is the
corresponding label. Then, use the following nonlinear function
to fit the given data labels. In order to estimate w, we can minimize the sum of squares as follows
n
w =argminX((wTxi) yi)2
w
i=1
= argmin
where w denotes the optimal parameter vector, y is the vector containing labels y1,,yn, and the vector output function f(w) is defined as
.
The above problem is non-convex unlike the ordinary least squares and logistic regression problems.
Download the provided MNIST dataset (mnist all.mat), where you will use only the samples belonging to digit 0 (train0) and digit 1(train1). Next, we will apply the Gauss-Newton (GN) heuristic to approximately solve the non-convex optimization roblem. In the GN algorithm, you first need to linearize the nonlinear vector output function f(w). The first order expansion of f(w) around the current estimate, wk is given by
f(w) f(wk) + Jk(w wk) (1)
where Jk Rnd is the Jacobian matrix at the iteration k whose ijth entry is defined as
.
Then, Gauss-Newton algorithm performs the update
wk+1 =argmin . (2) w
- Find the Gauss-Newton update (2) explicitly for this problem by deriving the Jacobian. Describe how the sub-problems can be solved. What is the computational complexity per iteration for n d data matrices?
One can incorporate a step-size (0,1] in the above update as follows.
wk+1 = (1 )wk + dk
where dk := argmin . (3) w
This is referred to as damped Gauss-Newton algorithm.
- Implement the damped Gauss Newton algorithm on the MNIST dataset to classify digits 0 and 1. Find a reasonable step-size for fast convergence by trial and error. Plot the training error, i.e., , as a function of the iteration index.
- You will now apply sampling to reduce computational complexity of the GN algorithm. Apply uniform sampling to the rows of the Least-Squares sub-problem (3) at each iteration. Plot the training error, i.e., , as a function of the iteration index. Repeat this procedure for row norm scores sampling.
Randomized Kaczmarz Algorithm
In this question, you will implement the Randomized Kaczmarz Algorithm (see Lecture 16 slides) on the YearPredictionMSD dataset, which can be downloaded from https://archive. ics.uci.edu/ml/datasets/YearPredictionMSD. In this file, there are 463,715 training and 51,630 test samples (songs) with 90 audio features and your goal is to predict the release year of each song using least squares regression. After downloading the file, you can import and partition the data using the following command in Python
import numpy as np file_name=YearPredictionMSD.txt file = open(file_name, r)
data=[] for line in file.readlines():
fname = line.rstrip().split(,) data.append(fname)
data_arr=np.asarray(data).astype(float) A=data_arr
b=(data_arr[:,0]-np.min(data_arr[:,0]))/(np.max(data_arr[:,0])-np.min(data_arr[:,0])) You may also normalize the data matrix as follows
meanA=np.mean(A,axis=0) stdA=np.std(A,axis=0) A=(A-meanA)/stdA
- Plot the training and test cost vs iteration curves for Randomized Kaczmarz Algorithm (i.e., the one with optimal sampling distribution) and the randomized algorithm with uniform sampling distribution on the same figure. For the uniformly sampled version, you should use an explicit learning rate and tune this parameter by trial and error.
- Note that the analysis of the Randomized Kaczmarz Algorithm assumes that the linear system Ax = b is consistent, which is may not be valid in our case. Show that an optimal solution of the least squares problem min can be found via solving
the augmented linear system .
- Since the augmented linear system in part (b) is consistent, we may apply Randomized Kaczmarz Algorithm. Repeat part (a) using Randomized Kaczmarz algorithm on the augmented linear system.
- Repeat part (a) using SGD with diminishing step-sizes (see page 11 of Lecture 16 slides) by tuning the initial step size.
Randomized Low-Rank Approximation and Randomized SVD
In this problem, you will implement a randomized approach in order to obtain a low-rank approximation for a given data matrix. Assume that you are given a data matrix A
Rnd with UV T as its SVD. The best rank-k approximation for the data matrix is Ak =
. However, since this is computationally expensive, you need to use
another approach to reduce the complexity.
First, you will generate a 50001000 data matrix with decaying singular values. To construct such a matrix, you can first generate a random Gaussian matrix with zero mean and identity covariance. Then, compute the SVD of this matrix as UV T. Now, replace with a diagonal matrix with decaying entries, ii = i2. Then, you can construct the data matrix with decaying singular values as A = UV T.
- One approach to obtain a rank-k approximation is as follows.
- Obtain a random approximation for the data matrix A as C by uniformly sampling k columns and scaling appropriately.
- Verify the approximation by computing the relative error kAAT CCTkF/kAATkF.
- Compute the randomized rank-k approximation defined as Ak = CCA.
Plot the approximation error, i.e., , as a function of the rank k. Verify the error bound we have in Lecture 19 slides, i.e., , and find the numerical value of . Repeat this procedure for uniform sampling, column norm score sampling, Gaussian sketch, +1/ 1 i.i.d. sketch, and randomized Hadamard (FJLT) sketch.
- Now we use the low rank approximation in part (a) to produce an approximate Singular Value Decomposition of A. The Randomized SVD algorithm is as follows.
- Generate a sketching matrix S.
- Compute C = AS
- Calculate QR decomposition: C = AS = QR
- Calculate the SVD of QTA, i.e., QTA = UV T
- Approximate SVD of A is given by A (QU)V T.
- Note that the randomized rank-k approximation Ak := (QU)V T = QQTA = CCA as in part (a)
Plot the approximation error, i.e., , as a function of the rank k. To compare with the exact SVD, plot , where Ak is the best rank k approximation of A using the SVD of A. Repeat the whole procedure for uniform sampling, column norm score sampling, Gaussian sketch, +1/1 i.i.d. sketch, and randomized Hadamard (FJLT) sketch.
CUR decomposition
CUR decomposition is a dimensionality reduction and low-rank approximation method in a similar spirit to Singular Value Decomposition (SVD). One particular difficulty of SVD is that the left and right singular vectors are difficult to interpret since they lack any direct meaning in terms of the original data. On the other hand, CUR decomposition is interpretable since it involves small subsets of the rows/columns of the original data.
Suppose that A is an m n matrix of approximate rank k, and that we have two column/row subsampled approximations
| C := A(:,JS) | (4) |
| R := A(IS,🙂 | (5) |
where IS,JS are appropriate subsets. Then we can approximate the matrix A via
A CCARR = CUR, (6)
where U := CAR and the superscript denotes the pseudoinverse operation. We choose set of columns C and a set of rows R, which play the role of U and V in SVD. We may pick any number of rows and columns. Therefore, this factorization provides an interpretable alternative to the Singular Value Decomposition. Furthermore, CUR has computational advantages especially for sparse matrices. Particularly, since CUR directly select random rows and columns C and R are sparse for a sparse matrix A, whereas U and V in SVD can still be dense matrices.
You will be using the movie ratings dataset from the https://movielens.org/ website provided at https://grouplens.org/datasets/movielens/100k/. The dataset contains 100,000 ratings from 943 users on 1682 movies. The usermovies matrix of dimension 943 x 1682 is very sparse.
- For m = 1,,10, apply uniform row/column subsampling with sample size m to obtain C and R matrices of rank m and solve U to obtain the CUR decomposition. Plot the approximation error in Frobenius norm and spectral norm as a function of m.
- Repeat part (a) using `2 row/column norm scores for row/column subsampling respectively.
- Repeat part (a) using leverage scores of A and AT for row/column subsampling respectively.

![[Solved] (EE270) Large Scale Matrix Computation, Optimization and Learning -HW #4](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] (EE270)Large Scale Matrix Computation, Optimization and Learning-HW #3](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.