5/5 - (1 vote)

Conceptual Questions

Name: [Solved] CSE446 Homework #3 -Kernels
Brand: Assignment Chef
SKU: [Solved] CSE446 Homework #3 -Kernels
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

The answers to these questions should be answerable without referring to external materials. Briefly justify your answers with a few words.
[2 points] True or False: The maximum margin decision boundaries that support vector machines construct have the lowest generalization error among all linear classifiers.
[2 points] Say you trained an SVM classifier with an RBF kernel (). It seems to underfit the training set: should you increase or decrease ?
[2 points] True or False: Training deep neural networks requires minimizing a non-convex loss function, and therefore gradient descent might not reach the globally-optimal solution.
[2 points] True or False: It is a good practice to initialize all weights to zero when training a deep neural network.
[2 points] True or False: We use non-linear activation functions in a neural networks hidden layers so that the network learns non-linear decision boundaries.
[2 points] True or False: Given a neural network, the time complexity of the backward pass step in the backpropagation algorithm can be prohibitively larger compared to the relatively low time complexity of the forward pass step.

Kernels

[5 points] Suppose that our inputs _xare one-dimensional and that our feature map is infinite-dimensional: _(x)is a vector whose _ith component is

for all nonnegative integers _i. (Thus, is an infinite-dimensional vector.) Show that is a

kernel function for this feature map, i.e.,

Hint: Use the Taylor expansion of _e^z. (This is the one dimensional version of the Gaussian (RBF) kernel). 3. This problem will get you familiar with kernel ridge regression using the polynomial and RBF kernels. First, lets generate some data. Let n = 30 and f(x) = 4sin(x)cos(6x²). For i = 1,,n let each x_ibe drawn uniformly at random on _[0,1]and where.

For any function _f, the true error and the train error are respectively defined as

Using kernel ridge regression, construct a predictor

n = argmin||K y||²+ ^TK , f_b(x) = ^X_bik(x_i,x)

i=1

where _K_i,j_{= k(x}_i_,x_j₎is a kernel evaluation and is the regularization constant. Include any code you use for your experiments in your submission.

[10 points] Using leave-one-out cross validation, find a good and hyperparameter settings for the following kernels:
- k_poly(x,z) = (1 + x^Tz)^dwhere d N is a hyperparameter,
- k_rbf(x,z) = exp(kx zk²) where > 0 is a hyperparameter^[1]. Report the values of _d, , and the values for both kernels.
[10 points] Let and be the functions learned using the hyperparameters you found in part
For a single plot per function _b, plot the original data, the true _f_(x), and (i.e., define a fine grid on _[0,1]to plot the functions).

Neural Networks for MNIST

In Homework 1, we used ridge regression for training a classifier for the MNIST data set. In Homework 2, we used logistic regression to distinguish between the digits 2 and 7. In this problem, we will use PyTorch to build a simple neural network classifier for MNIST to further improve our accuracy.

We will implement two different architectures: a shallow but wide network, and a narrow but deeper network. For both architectures, we use _dto refer to the number of input features (in MNIST, _d_{= 28}²_{= 784}), _h_ito refer to the dimension of the _ith hidden layer and _kfor the number of target classes (in MNIST, _k_{= 10}). For the non-linear activation, use ReLU. Recall from lecture that

ReLU

Weight Initialization

Consider a weight matrix _WRⁿ^mand _bRⁿ. Note that here _mrefers to the input dimension and _nto the output dimension of the transformation _Wx_{+ b}. Define. Initialize all your weight matrices and biases according to Unif₍_,₎.

Training

For this assignment, use the Adam optimizer from torch.optim. Adam is a more advanced form of gradient descent that combines momentum and learning rate scaling. It often converges faster than regular gradient descent. You can use either Gradient Descent or any form of Stochastic Gradient Descent. Note that you are still using Adam, but might pass either the full data, a single datapoint or a batch of data to it. Use cross entropy for the loss function and ReLU for the non-linearity.

Implementing the Neural Networks

[10 points] Let W₀ R^h^d, b₀ R^h, W₁ R^k^h, b₁ R^kand (z) : R R some non-linear activation function. Given some _xR^d, the forward pass of the wide, shallow network can be formulated as:

F₁(x) = W₁(W₀x + b₀) + b₁

Use _h_{= 64}for the number of hidden units and choose an appropriate learning rate. Train the network until it reaches _99%accuracy on the training data and provide a training plot (loss vs. epoch). Finally evaluate the model on the test data and report both the accuracy and the loss.

[10 points] Let W₀ R^h⁰^d, b₀ R^h⁰, W₁ R^h¹^h⁰, b₁ R^h¹, W₂ R^k^h¹, b₂ R^kand (z) : R R some non-linear activation function. Given some _xR^d, the forward pass of the network can be formulated as:

F₂(x) = W₂(W₁(W₀x + b₀) + b₁) + b₂

Use _h₀_{= h}₁_{= 32}and perform the same steps as in part a.

[5 points] Compute the total number of parameters of each network and report them. Then compare the number of parameters as well as the test accuracies the networks achieved. Is one of the approaches (wide, shallow vs. narrow, deeper) better than the other? Give an intuition for why or why not.

Using PyTorch: For your solution, you may not use any functionality from the torch.nn module except for torch.nn.functional.relu and torch.nn.functional.cross_entropy. You must implement the networks F from scratch. For starter code and a tutorial on PyTorch refer to the sections 6 and 7 material.

Using Pretrained Networks and Transfer Learning

So far we have trained very small neural networks from scratch. As mentioned in the previous problem, modern neural networks are much larger and more difficult to train and validate. In practice, it is rare to train such large networks from scratch. This is because it is difficult to obtain both the massive datasets and the computational resources required to train such networks.

Instead of training a network from scratch, in this problem, we will use a network that has already been trained on a very large dataset (ImageNet) and adjust it for the task at hand. This process of adapting weights in a model trained for another task is known as transfer learning.

Begin with the pretrained AlexNet model from torchvision.models for both tasks below. AlexNet achieved an early breakthrough performance on ImageNet and was instrumental in sparking the deep learning revolution in 2012.
Do not modify any module within AlexNet that is not the final classifier layer.
The output of AlexNet comes from the 6th layer of the classifier. Specifically, model.classifer[6] = nn.Linear(4096, 1000). To use AlexNet with CIFAR-10, we will reinitialize (replace) this layer with nn.Linear(4096, 10). This re-initializes the weights, and changes the output shape to reflect the desired number of target classes in CIFAR-10.

We will explore two different ways to formulate transfer learning.

[15 points] Use AlexNet as a fixed feature extractor: Add a new linear layer to replace the existing classification layer, and only adjust the weights of this new layer (keeping the weights of all other layers fixed). Provide plots for training loss and validation loss over the number of epochs. Report the highest validation accuracy achieved. Finally, evaluate the model on the test data and report both the accuracy and the loss.

When using AlexNet as a fixed feature extractor, make sure to freeze all of the parameters in the network before adding your new linear layer:

model = torchvision.models.alexnet(pretrained=True) for param in model.parameters():

param.requires_grad = False

model.classifier[6] = nn.Linear(4096, 10)

[15 points] Fine-Tuning: The second approach to transfer learning is to fine-tune the weights of the pretrained network, in addition to training the new classification layer. In this approach, all network weights are updated at every training iteration; we simply use the existing AlexNet weights as the initialization for our network (except for the weights in the new classification layer, which will be initialized using whichever method is specified in the constructor) prior to training on CIFAR-10. Following the same procedure, report all the same metrics and plots as in the previous question.

[1] Given a dataset _x₁_,,x_nR^d, a heuristic for choosing a range of in the right ballpark is the inverse of the median of all squared distances.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] CSE446 Homework #3 -Kernels

Conceptual Questions

Kernels

Neural Networks for MNIST

Weight Initialization

Training

Implementing the Neural Networks

Using Pretrained Networks and Transfer Learning

Reviews

Related products

[Solved] CSE446/598 Project 4

[Solved] CSE446 Project 3 (Assignments 5 and 6, 50+50 Points)

[Solved] CSE446 Homework #0 -Machine Learning

[Solved] CSE446/598 Project 5

[Solved] CSE446 Homework #4 -Basics of SVD

[Solved] CSE446 Homework #2 -Conceptual Questions