, , ,

[SOLVED] Ece-gy 6143 homework 1 to 6 solutions

$25

File Name: Ece_gy_6143_homework_1_to_6_solutions.zip
File Size: 348.54 KB

5/5 - (1 vote)

1. (10 points) Let {x1, x2, . . . , xn} be a set of points in d-dimensional space. Suppose we wish
to produce a single point estimate µ ∈ R
d
that minimizes the squared-error:
kx1 − µk
2
2 + kx2 − µk
2
2 + . . . + kxn − µk
2
2
Find a closed form expression for µ and prove that your answer is correct.
2. (10 points) Not all norms behave the same; for instance, the `1-norm of a vector can be
dramatically different from the `2-norm, especially in high dimensions. Prove the following
norm inequalities for d-dimensional vectors, starting from the definitions provided in class and
lecture notes. (Use any algebraic technique/result you like, as long as you cite it.)
a. kxk2 ≤ kxk1 ≤

dkxk2
b. kxk∞ ≤ kxk2 ≤

dkxk∞
c. kxk∞ ≤ kxk1 ≤ dkxk∞
3. (10 points) When we think of a Gaussian distribution (a bell-curve) in 1, 2, or 3 dimensions,
the picture that comes to mind is a “blob” with a lot of mass near the origin and exponential
decay away from the origin. However, the picture is very different in higher dimensions (and
illustrates the counter-intuitive nature of high-dimensional data analysis). In short, we will
show that Gaussian distributions are like soap bubbles: most of the mass is concentrated near
a shell of a given radius, and is empty everywhere else.
a. Fix d = 3 and generate 10,000 random samples from the standard multi-variate Gaussian
distribution defined in R
d
.
b. Compute and plot the histogram of Euclidean norms of your samples. Also calculate the
average and standard deviation of the norms.
c. Increase d on a coarsely spaced log scale all the way up to d = 1000 (say d =
50, 100, 200, 500, 1000), and repeat parts (a) and (b). Plot the variation of the average
and the standard deviation of Euclidean norm of the samples with increasing d.
d. What can you conclude from your plot from part (c)?
e. Bonus, not for grade. Mathematically justify your conclusion using a formal proof. You
are free to use any familiar laws of probability, algebra, or geometry.
1
4. (20 points) The goal of this problem is to implement a very simple text retrieval system. Given
(as input) a database of documents as well as a query document (all provided in an attached
.zip file), write a program, in a language of your choice, to find the document in the database
that is the best match to the query. Specifically:
a. Write a small parser to read each document and convert it into a vector of words.
b. Compute tf-idf values for each word in every document as well as the query.
c. Compute the cosine similarity between tf-idf vectors of each document and the query.
d. Report the document with the maximum similarity value.
5. (optional) How much time (in hours) did you spend working on this assignment?
2

1. (10 points) In class we derived the optimal linear predictor for scalar
data, and wrote down the optimal lnear predictor for vector data (without
proof). Here, let us derive a proof for the optimal linear predictor in the
vector case. Suppose that {x1, x2, . . . , xn} denote training data samples,
where each xi ∈ R
d
. Suppose {y1, y2, . . . , yn} denote corresponding (scalar)
labels.
a. Show that the mean squared-error loss function for multivariate linear
regression can be written in the following form:
MSE(w) = ky − Xwk
2
2
where X is an n × (d + 1) matrix and where the first column of X is
all-ones. What is the dimension of w and y? What do the coordinates
of w represent?
b. Theoretically prove that the optimal linear regression weights are
given by:
wˆ = (XT X)
−1XT
y?
What algebraic assumptions on X did you make in order to derive
the closed form?
2. (10 points) In class, we argue that convexity together with smoothness is
a sufficient condition for minimizing a loss function using gradient descent.
Is convexity also a necessary condition for gradient descent to successfully
train a model? Argue why or why not. You can intuitively explain your
answer in words and/or draw a figure in support of your explanation. (Hint:
What about f(x) = x
2 + 3 sin2 x?)
1
3. (20 points) The goal of this problem is to implement multivariate linear
regression from scratch using gradient descent and validate it.
a. Implement a function for learning the parameters of a linear model for
a given tranining data with user-specified learning rate η and number
of epochs T. Note: you cannot use existing libraries such as sklearn;
you need to write it out yourself.
b. Validate your algorithm on the glucose dataset discussed in Lecture 2.
Confirm that the model obtained by running your code gets similar
R2 values as the one produced by sklearn.
4. (10 points) In this lab, we will illustrate the use of multiple linear regression for calibrating robot control. The robot data for the lab is taken
from TU Dortmund’s Multiple Link Robot Arms Project. We will focus
on predicting the current drawn into one of the joints as a function of the
robot motion. Such models are essential in predicting the overall robot
power consumption.
a. Read in the data in the attached exp_train.csv file; check that the
data that you read actually corresponds to the data in the .csv file. In
Python, you can use the commands given at the end of this document.
b. Create the training data:
• Labels y: A vector of all the samples in the ‘I2’ column
• Data X: A matrix of the data with the columns: [‘q2’,‘dq2’,‘eps21’,
‘eps22’, ‘eps31’, ‘eps32’,‘ddq2’]
c. Fit a linear model between X and y (using sklearn, or any other
library of your choice). Report the MSE of your model.
d. Using the linear model that you trained above, report the MSE on
the test data contained in the attached exp_test.csv file.
5. (optional) How much time (in hours) did you spend working on this
assignment?
import pandas as pd
names =[
‘t’, # Time (secs)
‘q1’, ‘q2’, ‘q3’, # Joint angle
‘dq1’, ‘dq2’, ‘dq3’, # Joint velocity
‘I1’, ‘I2’, ‘I3’, # Motor current (A)
‘eps21’, ‘eps22’, ‘eps31’, ‘eps32’, # Strain measurements
‘ddq1’, ‘ddq2’, ‘ddq3’ # Joint accelerations
]
df = pd.read_csv(‘exp_train.csv’, header=None,sep=’,’,
names=names, index_col=0)
df.head(6)

1. (10 points) Suppose we wish to learn a regularized least squares model:
L(w) = 1
2
Xn
i=1
(yi − hw, xii)
2 + λR(w)
where R(w) is a regularization function to be determined. Suggest good choices for R(w) if
the following criteria need to be achieved (there are no unique correct answers) and justify
your choice in a sentence or two:
a. All parameters w are free to be determined.
b. w should be sparse (i.e., only a few coefficients of w are nonzero).
c. The coefficients of w should be small in magnitude on average.
d. For most indices j, wj should be equal to wj−1.
e. w should have no negative-valued coefficients.
2. (10 points) The Boston Housing Dataset has been collected by the US Census Service and
consists of 14 urban quality-of-life variables, with the last one being the median house price for
a given town. Code for loading the dataset is provided at the end of this assignment. Implement
a linear regression model with ridge regression that predicts median house prices from the
other variables. Use 10-fold cross validation on 80-20 train-test splits and report the final R2
values that you discovered. (You may want to preprocess your data to the range [0, 1] in order
to get meaningful results.)
3. (10 points) In class, we discussed the lasso objective, where the regularizer was chosen to
be the `1-norm. Here, we will derive an analytical closed form expression for the minimizer
of a slightly simpler problem. Suppose x is a d-dimensional input and w is a d-dimensional
variable. Show that the minimizer of the loss function:
L(w) = 1
2
kx − wk
2
2 + λkwk1
1
is given by:
w

i =



xi − λ if xi > λ,
xi + λ if xi < −λ,
0 otherwise.
4. (20 points) In this problem, we will implement logistic regression trained with GD/SGD and
validate on synthetic training data.
a. Suppose that the data dimension d equals 2. Generate two clusters of data points with
100 points each (so that the total data size is n = 200), by sampling from Gaussian
distributions centered at (0.5, 0.5) and (−0.5, −0.5). Call the data points xi
, and label
them as yi = ±1 depending on which cluster they originated from. Choose the variance
of the Gaussian to be small enough so that the data points are sufficiently well separated.
Plot the data points on the 2D plane to confirm that this is the case.
b. (Derive your own GD routines; do not use sklearn functions here.) Train a logistic
regression model that tries to minimize:
L(w) = −
Xn
i=1
yi
log 1
1 + e−hw,xii
+ (1 − yi) log e
−hw,xii
1 + e−hw,xii
using Gradient Descent (GD). Plot the decay of the training loss function as a function of
number of iterations.
c. Train the same logistic regression model, but this time using Stochastic Gradient Descent
(SGD). Demonstrate that SGD exhibits a slower rate of convergence than GD, but is faster
per-iteration, and does not suffer in terms of final quality. You may have to play around a
bit with the step-size parameters as well as mini-batch sizes to get reasonable answers.
d. Overlay the original plot of data points on the 2D data plane with the two (final) models
that you obtained above in parts b and c to visualize correctness of your implementation.
5. (optional) How much time (in hours) did you spend working on this assignment?
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data,
columns=boston_dataset.feature_names)
boston[‘MEDV’] = boston_dataset.target
boston.head()
2

1. (10 points) In class, we discussed how to represent XOR-like functions using quadratic
features, since standard linear classifiers (such as perceptrons) are insufficient for this task.
However, here we show that XOR-like functions can indeed be simulated using multi-layer
networks of perceptrons. This example shows a glimpse of the expressive power of “deep
neural networks”: merely increasing the depth from 1 to 2 layers can help reproduce nonlinear
decision boundaries.
a. Consider a standard two-variable XOR function, where we have 2-dimensional inputs
x1, x2 = ±1, and output y = x1(XOR)x2 =
(
−1 if x1 = x2
1 otherwise
.
Geometrically argue why a single perceptron cannot be used to simulate the above function.
b. Graphically depict, and write down the equation for, the optimal decision region for the
following logical functions:
(i) x1(AND)(NOT(x2))
(ii) (NOT(x1))(AND)x2
(iii) x1(OR)x2 Make note of the weights learned corresponding to the optimal decision
boundary for each function.
c. Using the above information, simulate a multi-layer perceptron network for the XOR
operation with the learned weights from Part (b).
2. (10 points) In this problem, we will implement the Perceptron algorithm discussed in class on
synthetic training data.
a. Suppose that the data dimension d equals 2. Generate two clusters of data points with
100 points each, by sampling from Gaussian distributions centered at (0.5, 0.5) and
(−0.5, −0.5). Choose the variance of the Gaussian to be small enough so that the data
points are sufficiently well separated. Plot the data points on the 2D plane to confirm that
this is the case.
b. Implement the Perceptron algorithm as discussed in class. Choose the initial weights to
be zero and the maximum number of epochs as T = 100, and the learning rate α = 1.
How quickly does your implementation converge?
1
c. Now, repeat the above experiment with a second synthetic dataset; this time, increase the
variance (radius) of the two Gaussians such that the generated data points from different
classes now overlap. What happens to the behavior of the algorithm? Does it converge?
Show the classification regions obtained at the end of T epochs.
3. (10 points) In Lectures 2 and 5, we derived a closed form expression for solving linear and
ridge regression problems. This is great for finding linear behavior in data; however, if the data
is nonlinear, just as in the classification case, we have to resort to the kernel trick, i.e., replace
all dot products in the data space with kernel inner products. In this problem, we theoretically
derive kernel regression. Suppose we are given training data {(x1, y1),(x2, y2), . . .(xn, yn)
where each response yi
is a scalar, and each data point xi
is a vector in d dimensions.
a. Assume that all data have been mapped into a higher dimensional space using the feature
mapping x 7→ φ(x), write down an expression for the squared error function using a
linear predictor w in the high-dimensional space.
b. Let Φ be the matrix with n rows, where row i consists of the feature mapping φ(xi).
Write down a closed form expression for the optimal linear predictor w as a function of
Φ and y.
c. For a new query data point z, the predicted value is given by f(z) = hw, φ(z)i. Plug in
the closed form expression for w from the previous sub-problem to get an expression for
f(z).
d. Suppose you are given black-box access to a kernel dot product function K where
K(x, x0
) = hφ(x), φ(x
0
)i. Mathematically show that all the calculations in (b) and (c)
can be performed by invoking the kernel dot product alone without explicitly writing
down φ(x) ever. You may want to use the Sherman-Morrison-Woodbury identity for
matrices:
(A
−1 + B
T C
−1B)
−1B
T C
−1 = ABT
(BABT + C)
−1
.
4. (20 points) The Fashion MNIST dataset is a database of (low-resolution) clothing images that
is similar to MNIST but somewhat more challenging. You can load it in Colab using the Python
code below.
a. Load the dataset and display 10 representative images from each class.
b. Implement the following classification methods: k-NN, logistic regression, and support
vector machines (with linear and rbf kernels). You can use sklearn.
c. Report best possible test-error performances by tuning hyperparameters in each of your
methods.
d. Report train- and test-running time of each of your methods in the form of a table, and
comment on the relative tradeoffs across the different methods.
import tensorflow as tf
from tensorflow import keras
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels),
(test_images, test_labels) = fashion_mnist.load_data()
2

1. (10 points) Consider a one-hidden layer neural network (without biases for simplicity) with
sigmoid activations trained with squared-error loss. Draw the computational graph and derive
the forward and backward passes used in backpropagation for this network.
yˆ = W2σ(W1x), L = kyˆ − yk
2
2
Qualitatively compare the computational complexity of the forward and backward passes.
Which pass is more expensive and by roughly how much?
2. (10 points) Suppose that a convolutional layer of a neural network has an input tensor X[i, j, k]
and computes an output as follows:
Z[i, j, m] = X
k1
X
k2
X
n
W[k1, k2, n, m]X[i + k1, j + k2, n] + b[m]
Y [i, j, m] = ReLU(Z[i, j, m])
for some kernel W and bias b. Suppose X and W have shapes (48, 64, 10) and (3, 3, 10, 20)
respectively.
a. What are the shapes of Z and Y ?
b. What are the number of input and output channels?
c. How many multiply- and add- operations are required to perform a forward pass through
this layer? Rough calculations are OK.
d. What are the total number of trainable parameters in this layer?
3. (10 points) The use of `2 regularization for training multi-layer neural networks has a special
name: weight decay. Assume an arbitrary dataset {(xi
, yi)}
n
i=1 and a loss function L(w) where
w represents the trainable weights (and biases).
a. Write down the `2 regularized loss, using a weighting parameter λ for the regularizer.
b. Derive the gradient descent update rules for this loss.
1
c. Conclude that in each update, the weights are “shrunk” or “decayed” by a multiplicative
factor before applying the descent update.
d. What does increasing λ achieve algorithmically, and how should the learning rate be
chosen to make the updates stable?
4. (20 points) In this exercise, we will fine-tune a pre-trained deep network (ResNet34) for a
particular two-class dataset which can be downloaded from the attached zip file. Code for
pre-processing the dataset, and for loading ResNet34, can be found below. Since ResNet34
is for 1000 output classes, you will need to modify the last layer to reduce two classes. Train
for 20 epochs and plot train and validation loss curves. Report your final train and validation
accuracies.
import torchvision
from torchvision import datasets, models, transforms
transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])
])
train_set = datasets.ImageFolder(“data/train”,transforms)
val_set = datasets.ImageFolder(“data/val”,transforms)
model = models.resnet34(pretrained=True)
2

1. (10 points) Assume that you have 4 samples each with dimension 3, described in the data
matrix X,
X =




3 2 1
2 4 5
1 2 3
0 2 5




For the problems below, you may do the calculations in python (or R or Matlab). Explain your
calculations in each step.
a. Find the sample mean.
b. Zero-center the samples, and find the eigenvalues and eigenvectors of the data covariance
matrix Q.
c. Find the PCA coefficients corresponding to each of the samples in X.
d. Reconstruct the original samples from the top two principal components, and report the
reconstruction error for each of the samples.
2. (10 points) In class, we analyzed the per-iteration complexity of k-means. Here, we will prove
that the k-means algorithm will terminate in a finite number of iterations. Consider a data set
X = {x1, . . . , xn} ∈ R
d
.
a. Show that the k-means loss function can be re-written in the form:
F(η, µ) = Xn
i=1
X
k
j=1
ηijkxi − µjk
2
where η = (ηij ) is a suitable binary matrix (with 0/1 entries). Provide a precise interpretation of η.
b. Show that each iteration of Lloyd’s algorithm can only decrease the value of F.
c. Conclude that the algorithm will terminate in no more than T iterations, where T is some
finite number. Give an upper bound on T in terms of the number of points n.
1
3. (10 points) Using the Senate Votes dataset demo’ed in Lecture 11, perform k-means clustering
with k = 2 and show that you can learn (most of) the Senators’ parties in a completely
unsupervised manner. Which Senators did your algorithm make a mistake on, and why?
4. (20 points) The Places Rated Almanac, written by Boyer and Savageau, rates the livability of
several US cities according to nine factors: climate, housing, healthcare, crime, transportation,
education, arts, recreation, and economic welfare. The ratings are available in tabular form,
available as a supplemental text file. Except for housing and crime, higher ratings indicate
better quality of life. Let us use PCA to interpret this data better.
a. Load the data and construct a table with 9 columns containing the numerical ratings.
(Ignore the last 5 columns – they consist auxiliary information such as longitude/latitude,
state, etc.)
b. Replace each value in the matrix by its base-10 logarithm. (This pre-processing is done
for convenience since the numerical range of the ratings is large.) You should now have a
data matrix X whose rows are 9-dimensional vectors representing the different cities.
c. Perform PCA on the data. Remember to center the data points first by computing the
mean data vector µ and subtracting it from every point. With the centered data matrix, do
an SVD and compute the principal components.
d. Write down the first two principal components v1 and v2. Provide a qualitative interpretation of the components. Which among the nine factors do they appear to correlate the
most with?
e. Project the data points onto the first two principal components. (That is, compute the
highest 2 scores of each of the data points.) Plot the scores as a 2D scatter plot. Which
cities correspond to outliers in this scatter plot?
f. Repeat Steps 2-5, but with a slightly different data matrix – instead of computing the
base-10 logarithm, use the z-scores. (The z-score is calculated by computing the mean µ
and standard deviation σ for each feature, and normalizing each entry x by x−µ
σ
). How
do your answers change?
2

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Ece-gy 6143 homework 1 to 6 solutions[SOLVED] Ece-gy 6143 homework 1 to 6 solutions
$25