[Solved] CS_DS541 Homework2 Deep Learning

$25

File Name: CS_DS541_Homework2__Deep_Learning.zip
File Size: 310.86 KB

SKU: [Solved] CS_DS541 Homework2 – Deep Learning Category: Tag:
5/5 - (1 vote)
  1. XOR problem [, on paper]: Show (by deriving the gradient, setting to 0, and solving mathematically, not in Python) that the values for w = (w1,w2) and b that minimize the function J(w,b) in Equation 6.1 (in the Deep Learning textbook) are: w1 = 0, w2 = 0, and b = 0.
  2. L2-regularized Linear Regression via Stochastic Gradient Descent [ in Python]: Train a 2-layer neural network (i.e., linear regression) for age regression using the same data as in homework 1. Your prediction model should be y = x>w+b. You should regularize w but not b. Note that, in contrast to Homework 1, this model includes a bias term.

Instead of optimizing the weights of the network with the closed formula, use stochastic gradient descent (SGD). There are several different hyperparameters that you will need to choose:

  • Mini-batch size n.
  • Learning rate .
  • Number of epochs.
  • L2 Regularization strength .

In order not to cheat (in the machine learning sense) and thus overestimate the performance of the network it is crucial to optimize the hyperparameters only on a validation set. (The training set would also be acceptable but typically leads to worse performance.) To create a validation set, simply set aside a fraction (e.g., 20%) of the age regression Xtr.npy and age regression ytr.npy to be the validation set; the remainder (80%) of these data files will constitute the actual training data. While there are fancier strategies (e.g., Bayesian optimization another probabilistic method, by the way!) that can be used for hyperparameter optimization, its common to just use a grid search over a few values for each hyperparameter. In this problem, you are required to explore systematically (e.g., using nested for loops) at least 4 different parameters for each hyperparameter.

Performance evaluation: Once you have tuned the hyperparameters and optimized the weights so as to minimize the cost on the validation set, then: (1) stop training the network and (2) evaluate the network on the test set. Report the performance in terms of unregularized MSE.

  1. Regularization to encourage symmetry [10 points, on paper]: Faces (and some other kinds of data) tend to be left-right symmetric. How can you use L2 regularization to discourage the weights from becoming too asymmetric? For simplicity, consider the case of a tiny 12 image. Hint: instead of usingIw as the L2 penalty term (where is the regularization strength), consider a different matrix in the middle. Your answer should consist of a 22 matrix S as well as an explanation of why it works.
  2. Recursive state estimation in Hidden Markov Models [10 points, on paper]: Teachers try to monitor their students knowledge of the subject-matter, but teachers cannot directly peer inside students brains. Hence, they must make inferences about what the student knows based on students observable behavior, i.e., how they perform on tests, their facial expressions during class, etc. Let random variable (RV) Xt represent the students state, and let RV Yt represent the students observable behavior, at time t. We can model the student as a Hidden Markov Model (HMM):
    • Xt depends only on the previous state Xt1, not on any states prior to that (Markov property), i.e.

P(xt | x1,,xt1) = P(xt | xt1)

  • The students behavior Yt depends only on his/her current state Xt, i.e.:

P(yt | xt,y1,,yt1) = P(yt | xt)

  • Xt cannot be observed directly (it is hidden).

A probabilistic graphical model for the HMM is shown below, where only the observed RVs are shaded (the latent ones are transparent):

Suppose that the teacher already knows:

  • P(yt | xt) (observation likelihood), i.e., the probability distribution of the students behaviors given the students state.
  • P(xt | xt1) (transition dynamics), i.e., the probability distribution of the students current state given the students previous state.

The goal of the teacher is to estimate the students current state Xt given the entire history of observations Y1,,Yt he/she has made so far. Show that the teacher can, at each time t, update his/her belief recursively:

P(xt | y1,,yt) P(yt | xt) X P(xt | xt1)P(xt1 | y1,,yt1)

xt1

where P(xt1 | y1,,yt1) is the teachers belief of the students state from time t 1, and the summation is over every possible value of the previous state xt1. Hint: You will need to use Bayes rule, i.e., for any RVs A, B, and C:

However, since the denominator in the right-hand side does not depend on a, this can also be rewritten as:

P(a | b,c) P(b | a,c)P(a | c)

  1. Linear-Gaussian prediction model [15 points, on paper]:

Probabilistic prediction models enable us to estimate not just the most likely or expected value of the target y (see figure above, right), but rather an entire probability distribution about which target values are more likely than others, given input x (see figure above, left). In particular, a linearGaussian model is a Gaussian distribution whose expected value (mean ) is a linear function of the input features x, and whose variance is 2:

Note that, in general, 2 can also be a function of x (heteroscedastic case). Moreover, non-linear Gaussian models are also completely possible, e.g., the mean (and possibly the variance) of the Gaussian distribution is output by a deep neural network. However, in this problem, we will assume that is linear in x, and that 2 is the same for all x (homoscedastic case).

MLE: The parameters of probabilistic models are commonly optimized by maximum likelihood estimation (MLE). (Another common approach is maximum a posteriori estimation, which allows the practitioner to incorporate a prior belief about the parameters values.) Suppose the training dataset

. Let the parameters/weights of the linear-Gaussian model be w, such that the

mean = x>w. Prove that the MLE of w and 2 given D is:

!

w

Note that this solution derived based on maximizing probability is exactly the same as the optimal weights of a 2-layer neural network optimized to minimize MSE.

Hint: Follow the same strategy as the MLE derivation for a biased coin in Class2.pdf. For a linearGaussian model, the argmax of the likelihood equals the argmax of the log-likelihood. The log of the Gaussian likelihood simplifies beautifully.

Put your code in a Python file called homework2 WPIUSERNAME1.py

(or homework2 WPIUSERNAME1 WPIUSERNAME2.py for teams). For the proofs, please create a PDF called homework2 WPIUSERNAME1.pdf

(or homework2 WPIUSERNAME1 WPIUSERNAME2.pdf for teams). Create a Zip file containing both your Python and PDF files, and then submit on Canvas.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CS_DS541 Homework2 Deep Learning
$25