**Principal Component Analysis.**In this exercise, you are provided part of the MNIST digit data set, containing 3,600 images of handwritten digits (data is provided in P1/X_train.csv). Each example is a 28 × 28 grayscale image, leading to 784 pixels. These images correspond to 3 different digits (’3’, ’6’, and ’9’), although you do not have the labels for them. You will perform PCA in an unsupervised manner.- To get familiar with what the data looks like, generate the last example (the 3
*,*600row of X_train) as an image. Provide your code and the resulting image. (^{th }*Hint:*You can reshape the data back into a 28 × 28 image using reshape in NumPy, and you can use imshow with cmap=’gray’ to see the picture. Make sure you are able to get the picture to face the right way, which should look like a ”9”.) - Implement PCA using SVD and run it on the data. You can use the function linalg.svd in NumPy. (
*Reminder:*Make sure you understand the NumPy function you used, especially whether the function standardizes the data set before running SVD. Make sure that the data is mean-centered.) Just like in the previous part, generate the first principal component vector as an image. Repeat for the second and third. Do these images make sense? Explain. - Create a 2D scatter plot showing all the data points projected in the first 2 PC dimensions, clearly labeling your axes. Make sure the data points are mean-centered before projecting onto the principal components. Now create a similar plot showing the same data points projected in the 100
and 101^{th }PC dimensions. What do you observe? Can you explain why you might expect this?^{st } - Graph the (in-sample) fractional reconstruction accuracy as a function of the number of principal components that are included. Also give a table listing the number of principal components needed to achieve each of 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, and 10% reconstruction accuracy (i.e., to explain X% of the variance).
- Using the numbers you found in the previous part, reconstruct the 1000
, 2000^{th}, and 3000^{th}examples using each of the different numbers of principal components. (^{th }*Hint:*We have provided a function plot_reconstruction for your convenience. Make sure you read the documentation within this function to understand its input arguments.) For instance, the last example looks like:

- To get familiar with what the data looks like, generate the last example (the 3

*Python Instructions:*

*Please follow the instructions in the comments on input/output specifications. Be aware that the dimension of your vector variables can be a source of bugs.**Include all of the codes you used for experiments and plotting in*PS4-P1.py*and follow submission instructions when submitting.*

**EM Practice: Red and Blue Coins.**Your friend has two coins: a red coin and a blue coin, with biases*p*and_{r }*p*, respectively (i.e. the red coin comes up heads with probability_{b}*p*, and the blue coin does so with probability_{r}*p*). She also has an inherent preference_{b}*π*for the red coin. She conducts a sequence of*m*coin tosses: for each toss, she first picks either the red coin with probability*π*or the blue coin with probability 1−*π*, and then tosses the corresponding coin; the process for each toss is carried out independently of all other tosses. You don’t know which coin was used on each toss; all you are told are the outcomes of the*m*tosses (heads or tails). In particular, for each toss*i*, define a random variable*X*as_{i }

(

1 if the *i*-th toss results in heads

*X _{i }*=

Then the data you see are the values *x*_{1}*,…,x _{m }*taken by these

*m*random variables. Based on this data, you want to estimate the parameters

*θ*= (

*π,p*). To help with this, for each toss

_{r},p_{b}*i*, define a latent (unobserved) random variable

*Z*as follows:

_{i }(

- if the
*i*-th toss used the red coin*Z*=_{i }

0 otherwise.

- Let
*X*be a random variable denoting the outcome of a coin toss according to the process described above, and let*Z*be the corresponding latent random variable indicating which coin was used, also as described above (both*X*and*Z*take values in {0*,*1} as above). Write an expression for the joint distribution of*X*and*Z*. Give your answer in the form

*p*(*x,z*; *θ*) = *.*

- Write an expression for the complete-data log-likelihood, ln
- Suppose you knew the values
*z*taken by the latent variables_{i }*Z*. What would be the maximumlikelihood parameter estimates_{i}*θ*_{b}? Give expressions for*π*_{b},*p*_{b}*r*, and*p*_{b}*b*(in terms of*x*and_{i }*z*). Show your calculations._{i} - In the absence of knowledge of
*z*, one possibility for estimating_{i}*θ*is to use the EM algorithm. Recall that the algorithm starts with some initial parameter estimates*θ*^{0}, and then on each iteration*t*, performs an E-step followed by an M-step. Let*θ*denote the parameter estimates at the start of iteration^{t }*t*. In the E-step, for each toss*i*, the algorithm requires computing the posterior distribution of the latent variable*Z*under the current parameters_{i }*θ*. Calculate the posterior probability^{t}**P**(*Z*= 1|_{i }*X*=_{i }*x*;_{i}*θ*).^{t} - For each toss
*i*, denote the posterior probability computed in part (d) above by*γ*(so that_{i}^{t }*γ*=_{i}^{t }**P**(*Z*= 1|_{i }*X*=_{i }*x*;_{i}*θ*)). Then the expected complete-data log-likelihood with respect to these posterior distributions is^{t}

* .*

The M-step of the EM algorithm requires finding parameters *θ ^{t}*

^{+1 }that maximize this expected complete-data log-likelihood. Determine the updated parameters

*θ*

^{t}^{+1}. Give expressions for

*π*

^{t}^{+1}, , and (in terms of

*x*and

_{i }*γ*). Show your calculations.

_{i}^{t}**Programming Exercise: Gaussian Mixture Models.**Write a piece of Python code to implement the EM algorithm for learning a Gaussian mixture model (GMM) given a data set**X**, where**X**is an*m*×*d*matrix (*m*instances, each of dimension*d*), and a target number of Gaussians*K*: your program should take the training data**X**and the target number of Gaussians*K*as input and should output estimated parameters of a mixture of*K*Gaussians (specifically, it should output mixing coefficients*π*b ∈ ∆, mean vectors_{K}^{µ}_{b}1*,…,*^{µ}_{b}*K*^{∈ }R, and covariance matrices^{d}**Σ**_{b1}*,…,***Σ**_{bK }^{∈ }R^{d}^{×d}). For this problem, you are provided a 2-dimensional data set generated from a mixture of (3) Gaussians. The data set is divided into training and test sets; the training set has 480 data points and the test set has 120 data points. The goal is to learn a GMM from the training data; the quality of the learned GMM will be measured via its log-likelihood on the test data.

**EM initialization and stopping criterion: **Initialize the mixing coefficients to a uniform distribution, and each of the covariance matrices to be the *d *× *d *identity matrix: *π*^{0 }= (1*/K,…,*1*/K*)^{>}, and

. Initializations for the means will be provided to you (if you like, you can write

your program to take the mean initialization as an additional input). For the stopping criterion, on each iteration *t*, keep track of the (incomplete-data) log-likelihood on the training data, ); stop when either the change in log-likelihood,), becomes smaller than 10^{−6}, or when the number of iterations reaches 1000 (whichever occurs earlier; here *θ ^{t }*denotes the parameter estimates on iteration

*t*).

**Known***K***.**For this part, assume you are given*K*= 3.**Learning curve.**Use your implementation of EM for GMMs to learn a mixture of 3 Gaussians from increasing fractions of the training data (10% of the training data, then 20% of the training data, then 30% and so on upto 100%). You must use the subsets provided in the folder P3/TrainSubsets; in each case, to initialize the means of the Gaussians, use the initializations provided in the folder P3/MeanInitialization. In each case, calculate the**normalized**(incomplete-data) log-likelihood of the learned model on the training data, as well as the normalized log-likelihood on the test data (here normalizing means dividing the log-likelihood by the number of data points). In a single plot, give curves showing the normalized train and test log-likelihoods (on the*y*-axis) as a function of the fraction of data used for training (on the*x*-axis). (Note: the training log-likelihood should be calculated only on the subset of examples used for training, not on all the training examples available in the given data set.)**Analysis of learned models.**For each of the learned models (corresponding to the different subsets of the training data), show a plot of the Gaussians in the learned GMM overlaid on the test data. What do you observe? For the final model (learned from 100% of the training data), also write down all the parameters of the learned GMM, together with the (normalized) train and test log-likelihoods.

*Hints:*

*You may use the function we provide in*plot multiple contour plots*to assist in plotting the density curves. Be sure to carefully read the function header, so you know how to structure your inputs.*

**Unknown***K***.**Now suppose you do not know the right number of Gaussians in the mixture model to be learned. Select the number of Gaussians*K*from the range {1*,…,*5} using 5-fold crossvalidation on the full training data (use the folds provided in the folder P3/CrossValidation). For each*K*, to initialize the means of the*K*Gaussians, use the initializations provided in the folder P3/MeanInitialization. In a single plot, draw 3 curves showing the (normalized) train, test, and cross-validation log-likelihoods (on the*y*-axis) as a function of number of Gaussians*K*(on the*x*-axis); Write down the chosen value of*K*(with the highest cross-validation log-likelihood).

*Python Instructions:*

* *

*Please follow the instructions in the comments on input/output specifications. Be aware that the dimension of your vector variables can be a source of bugs.**Implement the*E step*,*M step*, and*compute llh*methods in*py*. You may use the function*scipy.stats.multivariate normal*in your implementation.**Include all of the codes you used for experiments and plotting in*PS4-P3.py*and follow submission instructions when submitting.*

- (35 points)
**Hidden Markov Models.**On any given day, Alice is in one of two states: happy or sad. You do not know her internal state, but get to observe her activities in the evening. Each evening, she either sings, goes for a walk, or watches TV.

Alice’s state on any day is random. Her state *Z*_{1 }on day 1 is equally likely to be happy or sad:

*P*(*Z*_{1 }= happy) = 1*/*2*.*

Given her state *Z _{t }*on day

*t*, her state

*Z*

_{t}_{+1 }on the next day is governed by the following probabilities (and is conditionally independent of her previous states and activities):

*P*(*Z _{t}*

_{+1 }= happy |

*Z*= happy) = 4

_{t }*/*5

*P*(

*Z*

_{t}_{+1 }= happy |

*Z*= sad) = 1

_{t }*/*2

*.*

Alice’s activities are also random. Her activities vary based on her state; given her state *Z _{t }*on day

*t*, her activity

*X*on that day is governed by the following probabilities (and is conditionally independent of everything else):

_{t }*P*(*X _{t }*= sing |

*Z*= happy) = 5

_{t }*/*10

*P*(

*X*= sing |

_{t }*Z*= sad) = 1

_{t }*/*10

*P*(*X _{t }*= walk|

*Z*= happy) = 3

_{t }*/*10

*P*(

*X*= walk|

_{t }*Z*= sad) = 2

_{t }*/*10

*P*(

*X*= TV |

_{t }*Z*= happy) = 2

_{t }*/*10

*P*(

*X*= TV |

_{t }*Z*= sad) = 7

_{t }*/*10

*.*

- (15 points) Suppose you observe Alice singing on day 1 and watching TV on day 2, i.e. you observe
*x*_{1:2 }= (sing*,*TV). Find the joint probability of this observation sequence together with each possible hidden state sequence that could be associated with it, i.e. find the four probabilities below. Show your calculations.

*P X*_{1:2 }= (sing*,*TV)*,Z*_{1:2 }= (happy*,*happy) *P X*_{1:2 }= (sing*,*TV)*,Z*_{1:2 }= (happy*,*

) *P X*_{1:2 }= (sing*,*TV)*,Z*_{1:2 }= (sad*,*sad)

Based on these probabilities, what is the most likely hidden state sequence *z*_{1:2}? What is the individually most likely hidden state on day 2 ?

- Write a small piece of Python code to implement the Viterbi algorithm, and use this to calculate both the most likely hidden state sequence
*z*_{1:10 }and the corresponding maximal joint probability*p*(*x*_{1:10}*,z*_{1:10}) for each of the following observation sequences:

*x*_{1:10 }= (sing*,*walk*,*TV*,*sing*,*walk*,*TV*,*sing*,*walk*,*TV*,*sing) *x*_{1:10 }= (sing*,*sing*,*sing*,*TV*,*TV*,*TV*,*TV*,*TV*,*TV*,*TV) *x*_{1:10 }= (TV*,*walk*,*TV*,*sing*,*sing*,*sing*,*walk*,*TV*,*sing*,*sing)

In your code, rather than multiplying probabilities (which can lead to underflow), you may find it helpful to add their logarithms (and later exponentiate to obtain the final result). Include a snippet of your code in your L^{A}TEX submission.

*Python Instructions:*

*Please follow the instructions in the comments on input/output specifications. Be aware that the dimension of your vector variables can be a source of bugs.**Include your implementation of the Viterbi algorithm*viterbi*and all of the codes you used for experiments and plotting in*PS4-P4.py*and follow submission instructions when submitting.*

**Bayesian Networks.**Consider the Bayesian network over 8 random variables*X*_{1},*X*_{2},*X*_{3},*X*_{4},*X*_{5},*X*_{6},*X*_{7},*X*_{8 }shown below (assume for simplicity that each random variable takes 2 possible values):

- Write an expression for the joint probability mass function
*p*(*x*_{1}*,x*_{2}*,x*_{3}*,x*_{4}*,x*_{5}*,x*_{6}*,x*_{7}*,x*_{8}) that makes the same (conditional) independence assumptions as the Bayesian network above. - Consider a joint probability distribution satisfying the following factorization:

*p*(*x*_{1}*,x*_{2}*,x*_{3}*,x*_{4}*,x*_{5}*,x*_{6}*,x*_{7}*,x*_{8}) = *p*(*x*_{1})*p*(*x*_{2})*p*(*x*_{3})*p*(*x*_{4 }|*x*_{1})*p*(*x*_{5 }|*x*_{2}*,x*_{3})*p*(*x*_{6 }|*x*_{3}*,x*_{5})*p*(*x*_{7})*p*(*x*_{8 }|*x*_{5})*.*

Is this distribution included in the class of joint probability distributions that can be represented by the Bayesian network above? Briefly explain your answer.

- If the edge from
*X*_{5 }to*X*_{8 }is removed from the above network, will the class of joint probability distributions that can be represented by the resulting Bayesian network be smaller or larger than that associated with the original network? Briefly explain your answer. - Given the above figure, determine whether each of the following is true or false. Briefly justify your answer.

*X*_{4 }⊥*X*_{5 }ii.*X*_{2 }⊥*X*_{6}

iii. *X*_{3 }⊥ *X*_{4 }|*X*_{8 }iv. *X*_{2 }⊥ *X*_{8 }|*X*_{5}

## Reviews

There are no reviews yet.