5/5 - (1 vote)

1 Written Questions

Name: [Solved] 10418 HW4- LDA
Brand: Assignment Chef
SKU: [Solved] 10418 HW4- LDA
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

Answer the following questions in the template provided. Then upload your solutions to Gradescope. You may use L^AT_EX or print the template and hand-write your answers then scan it in. Failure to use the template may result in a penalty. There are 56 points and 10 questions.

1.1 Markov Chain Monte Carlo Methods

In class, we studied two Monte Carlo estimation methods: rejection sampling and importance sampling. Given a proposal distribution Q(x), answer the following questions:
- (1 point) If sampling from Q(x) is computationally expensive, which of the following methods is likely to be more efficient?

Rejection Sampling

Importance Sampling

Both are equally inefficient

(1 point) If Q(x) is high-dimensional, which of the following methods is more efficient?

Rejection Sampling

Importance Sampling

Both are equally inefficient

(1 point) For high-dimensional distributions, MCMC methods such as Metropolis Hastings are more efficient than rejection sampling and importance sampling.

True

False

(1 point) For low-dimensional distributions, MCMC methods such as Metropolis Hastings produce better sammples than rejection sampling and importance sampling.

True

False

(2 points) Suppose you are using MCMC methods to sample from a distribution with multiple modes.

Briefly explain what complications may arise while using MCMC.

1.2 Metropolis Hastings

(4 points) Recall that conjugate priors lead to the posterior belonging to the same probability distribution family as the prior. Show that the beta distribution is the conjugate prior for the binomial distribution.
Show that
- (5 points) The Metropolis Hastings algorithm satisfies detailed balance.
- (5 points) A variant of the Metropolis Hastings algorithm with the acceptance probability defined without the minimum, ie acceptance probability, wouldnt satisfy the detailed balance.
- (5 points) A sampler which resamples each variable conditioned only on the parents of a Bayesian Network wouldnt satisfy the detailed balance.

1.3 Gibbs Sampling and Topic Modeling

(3 points) Consider the following graphical model:

Figure 1.1: Bayesian Network Structure

Assume that we run Gibbs sampling on this graph, write the conditional probabilities which will be computed by the sampler for each variable. The variables are sampled in the order ABCDFE

Now lets run a collapsed Gibbs sampler for a topic model on a toy dataset. We will run sampling for a regular LDA (Latent Dirichlet Allocation) model. Suppose our dataset consisting of 4 documents is as follows:

X = {fizz fizz buzz fizz fizz buzz, fizz buzz fizz buzz, foo bar foo bar foo bar, fizz buzz foo bar}

where x_mnrefers to the m^thword in document n. The vocabulary of this dataset consists of 4 unique words. Assume that we run our LDA model with 2 topics, prior parameters set to = 0.1, = 0.1 and start with the following sample of topic assignments:

Z = {0 0 1 0 1 1,

1 0 0 1,

0 1 0 1 1 0, 0 1 0 1}

where z_mnrefers to the topic assigned to the m^thword in document n. Note that and are the prior parameters for the document-topic and word-topic distributions respectively.

(1 point) Compute the word-topic counts table with this assignment

	0	1
fizz
buzz
foo
bar

(1 point) Compute the document-topic counts table with this assignment

	0	1
0
1
2
3

(2 points) Suppose we want to sample a new topic assignment z₀₀for word x₀₀. Compute the full conditional probability distribution from which this assignment will be sampled under the collapsed Gibbs sampler.

	0	1

(2 points) Assume that the new topic assignment is z₀₀= 1. What is the updated word-topic distribution?

	0	1
fizz
buzz
foo
bar

1.4 Empirical Questions

The following questions should be completed after you work through the programming portion of this assignment (Section 2).

(10 points) After training your model, report the top 10 most probable words for 5 of your favorite discovered topics. Note that here most probable is defined as the word whose embedding maximizes the t distribution probability for a given class.

Rank	Topic 0	Topic 1	Topic 2	Topic 3	Topic 4
1
2
3
4
5
6
7
8
9
10

(10 points) Plot the log-likelihood of the data defined by equation 2 over training for 50 iterations. Record this probability every 10 iterations. Note: Your plot must be machine generated.
(1 point) Multiple Choice: Did you correctly submit your code to Autolab?

Yes

(1 point) Numerical answer: How many hours did you spend on this assignment?.

1.6 Collaboration Policy

After you have completed all other components of this assignment, report your answers to the collaboration policy questions detailed in the Academic Integrity Policies for this course.

Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details including names of people who helped you and the exact nature of help you received.
Did you give any help whatsoever to anyone in solving this assignment? If so, include full details including names of people you helped and the exact nature of help you offered.
Did you find or come across code that implements any part of this assignment? If so, include full details including the source of the code and how you used it in the assignment.

2 Programming [ pts]

2.1 Task Background

Topic modeling is the task of finding latent semantic categories that group words together called topics. The typical approach to this task is to build a model by telling a generative story: For each document we assign a probability of picking a set of topics and for each topic I have some probability of picking any given word. The decision of what these probabilities are and how our generative story plays out determines not just how successful our model is, but also how computationally feasible inference is.

Because weve seen the power of vector representations of words in previous assignments, lets assume that we want to model words as vectors in R^M. A reasonable assumption is that words that belong to similar topics are close to each other in embedding space. One way to encode this intuition is to say that given a topic k, a word embedding v is drawn from the distribution N(_k,I). This implies that if we can discover these _ks we can discover a set of topics that explain our observed words. In order to make this practical, well need to flesh out our generative story.

Suppose were given a corpus of D documents and a fixed number of topics K. Our model will look like this:

For topic k = 1 to K:

(a) Draw topic mean

For each document d in corpus D:
- Draw topic distribution _d Dir()
- For each word index n from 1 to N_d

Draw a topic z_n Categorical(_d) ii. Draw v_d,n N(_z_n,I)

In our case, we will fix K = 5, |D| = 1208, and our embeddings live in R⁵⁰.

We will see in section 2.3 how this can be translated into a practical sampling algorithm.

Data

For this task weve provided you with a selection of news articles in the numpy file news_corpus.npy. This array is |D| |V | where V is our vocabulary. Each i,j entry corresponds to the number of instances of word j in document i.

In order to save you the trouble of training word embeddings before you can even start the real task, we have also provided 50 dimensional GloVe embeddings in the numpy file named news_embeddings.npy along with a dictionary to map between words to indices in the pickle file named mews_word2index.pkl. The embedding array is a |V | 50 array.

Gibbs Sampler

Given the generative model described earlier wed like to know the posterior distribution over our parameters. As is the case for almost all interesting models, we dont have an analytical form for this. Well make do by approximating the posterior with a collapsed Gibbs sampler. Itll be a Gibbs sampler because we will iteratively sample word-topic assignments conditioned on all over assignments and parameters. Its said to be collapsed because we will have integrated over nuisance parameters to provide a simpler distribution to sample from.

(2.1)

Equation 2.1 shows the full conditional on which we can build a Gibbs sampler. This equation has a lot to unpack.

Prior Parameters

In the above equation, our prior parameters were summarized as the tuple = (,,,) along with the vector of parameters . These are the same parameters as those in the introductory section. We will treat the hyperparameter as _i= 10 for all classes i. We will also set = 0.01, = I_M, and = M.

Data Statistics

N_k= #{words assigned to topic k across all documents}

Updated Parameters

_k= + N_k_k= + N_k

Important Distributions

The most important distribution for you to be familiar with for this assignment is the multivariate t distribution given parameters (_k,_k,⁰). The final parameter is the degrees of freedom for the distribution and is denoted in the subscript of the distribution t0(). We have provided a function in the file utils.py called multivariate_t_distribution that will return log-probabilities according to this model.

2.4 Implementation Details

Now that we have our notation sorted out, we can ensure that we understand the steps to implement our sampler.

Implementation Outline

Load in the provided embeddings and the text of our corpus.
Randomly initialize a dictionary that places each word in each document into a given topic.
Calculate statistics and update our posterior parameters and calculate the full conditional in equation 1 for each word and each possible topic assignment. Note: This will require adjusting the statistics to not include the current word.
Normalize the above distributions and sample from a multinomial over the k topics with the posterior distribution.
Reassign words to topics based on the above samples.
Go back to step 4 and repeat until convergence.

Numerical Issues

For numerical stability, you must calculate log-probabilities and only convert to regular probability distribution before the normalization step.

One way to increase stability is to keep track of the maximum log-probability encountered while iterating over topics in step 5. Then before normalization, subtract of this maximum log-probability from each logprobability.

2.5 Evaluation

We will track the joint probability of z_k,v_d,i,_k,_d, for each word i, document d, and currently assigned category k. This will be calculate using the following equation.

p(z_k,v_d,i,_k,_d) = p(v_d,i|z_k,_k)p(z_k|_k)p(_k)p(_d) (2.2)

Please refer to the generative model for Gaussian-LDA to define each of these probabilities. Also note that because were using a collapsed Gibbs sampler, we will not have estimate one of these parameters before.

2.6 Autolab Submission [35 pts]

You must submit a .tar file named gaussianlda.tar containing gaussianlda.py, which contains all of your code.

You can create that file by running:

tar -cvf gaussianlda.tar gaussianlda.py

from the directory containing your code.

Some additional tips: DO NOT compress your files; you are just creating a tarball. Do not use tar -czvf. DO NOT put the above files in a folder and then tar the folder. Autolab is case sensitive, so observe that all your files should be named in lowercase. You must submit this file to the corresponding homework link on Autolab.

Your code will not be autograded on Autolab. Instead, we will grade your code by hand; that is, we will read through your code in order to grade it. As such, please carefully identify major sections of the code via comments.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] 10418 HW4- LDA

1 Written Questions

2 Programming [ pts]

Prior Parameters

Data Statistics

Updated Parameters

Important Distributions

Implementation Outline

Numerical Issues

Reviews

Related products

[Solved] 10418 HW3-ImageSeg

[Solved] 10418 HW1-Acoustic Speech Recog

[Solved] 10418 HW2-Belief Prop

[Solved] 10418 HW5- Variational Inference