[SOLVED] algorithm math python Spark graph Bayesian Assignment 6

$25

File Name: algorithm_math_python_Spark_graph_Bayesian_Assignment_6.zip
File Size: 518.1 KB

5/5 - (1 vote)

Assignment 6
MET CS 777 Big Data Analytics GMM Topic Models Gibbs Sampling (20 points)
GitHub Classroom Invitation Link
https://classroom.github.com/a/_PNcma__
1 Description
In this assignment, you will be deriving a MCMC algorithm for and implementing, and then apply- ing, a text clustering model.
There are two sub-assignments: First (1) do the math necessary to derive your algorithm and then write it up (create a PDF file or pencil-and-paper). Then (2) you will implement the algorithm and use it to cluster a corpus of text documents. Id suggest that you use Spark for your implemen- tation (especially if you are comfortable with Spark at this point), at least to do the initial processing of the input data set, though this is certainly not required. It is also not required that you run your learning algorithm on Cloud Amazon/Google. If your implementation is fast enough to run on your laptop, that is fine. The data set that we are using this time around is significantly smaller than the data sets weve been using for the last few assignments, so EC2 might not be necessary.
NOTE: it is fine to work in teams of two or three on this assignment if you would like, though this is optional.
2 Data
You will be dealing with the widely-used 20 newsgroups data set
http://qwone.com/ jason/20Newsgroups/
A newsgroup post is like an old-school blog post, and this data set has 19,997 such posts from 20 different categories, according to where the blog post was made. The examples of 20 categories are : alt.atheism, comp.graphics, talk.politics.mideast, talk.religion.misc
The format of the text data set is exactly the same as the text format used for the documents in A5 and A4; the category name can be extracted from the name of the document.
Forexample,thedocumentwithidentifier20 newsgroups/comp.graphics/37261is fromthecomp.graphicscategory.Thedocumentwiththeidentifier20 newsgroups/sci.med/59082 is from the sci.med category. We have prepared the data as single line format so that it can be read
line by line. Each line is a document.
Direct Download link from AWS
https://s3.amazonaws.com/metcs777/20-news-same-line.txt
1

Amzon AWS
20 news data on line (35.2 MB of text)
s3://metcs777/20-news-same-line.txt
Table 1: Data set on Amazon AWS URLs
Table 2: Data set on Google Cloud Storage URLs
3 Assignment Tasks
There are four separate tasks that you need to complete to finish the assignment.
3.1 Task 1 : (5 points)
First, your task is to derive and then write up an MCMC algorithm that will allow you to cluster the text documents; if everything works properly, the 20 clusters that you obtain by clustering the documents will roughly correspond to the 20 categories present in the data.
Youll want to use the example MCMC derivation for the mixture of Gaussians from the lecture Learning Mixture Models in slides as the basis for your own derivation, though our model will be just a bit different. Also look at the example of how we derived the MCMC algorithm for learning LDA we did that one in quite a bit of detail, though again, our model here is just a bit different. In your answer, you should give the algorithm in enough detail (including exact formulas) that someone can go off and implement it.
Of course, when you derive a MCMC algorithm for Bayesian machine learning, you always need a generative process that you are trying to reverse. Our generative process is going to look a lot like the generative process for the Gaussian mixture model, except that the data that our generative process is creating is a list of term count vectors, rather than multi-dimensional points as in a GMM. Thus, we are going to replace our mixture of Gaussians with a mixture of Multinomials.
Given k document clusters or categories, our generative process for n documents (where the number of words in document i is li is as follows:
Dirichlet() for j=1tokdo
j Dirichlet() end for
for i = 1 to n do
ci Categorical()
xi Multinomial(ci,li)
end for
The result of this generative process is a list of n vectors, where xi gives us the number of occurrences of each word in the ith document (that is, xi,j is the number of occurrences of word j in document i).
Given this, what I am looking for out of this first task is a precise description of the Gibbs sampler that is going to recover the vector that gives us the prevalence of each of the categories, the k different j probability vectors that tell us how prevalent each word is in each category in the document, and (finally) the n different ci values that tell us the category of each document.
Google Cloud Storage
20 news data on line (35.2 MB of text)
gs://metcs777/20-news-same-line.txt
2

3.2 Task 2 (5 Points)
Next (this is very similar to A4 and A5) you will process the input corpus, build a dictionary, and transform each of the input documents into a count vector that has 20,000 entriesnote that last time, we used TF vectors, but since here we have a Multinomial model, well be using count vectors. For example, for a particular document, the entry in the 177th value in this vector is an integer that tells us the number of times the 177th most common word in the dictionary appeared in this document.
To get credit for this task, give the number of times that each of the 100 most common dictionary wordsappearindocument20 newsgroups/comp.graphics/37261
Note-1: It is a good idea to map all words to lowercase so that the same word with different capitalization are matched and can be count up as same word.
3.3 Task 3 (5 Points)
Now you will actually implement the MCMC algorithm that you derived in Task 1, and run it for 200 iterations on the 20 newsgroups data, with the goal of learning the , j , and ci values. Learn 20 different mixture components.
To get credit for this task, I am going to ask you to print out the 50 most important (highest probability) words in each of the 20 mixture components that you learned. Hopefully, these are somewhat meaningful!
Note-2: Every iteration has to do five things:
1) Map where you produce (ci, xi) pairs from each xi.
2) Reduce where you sum up all of the xis with the same ci value.
3) Reduce where you count the number of xis assigned to each cluster.
4) Update the various s using the result of 2.
5) Update using the result of 3.
Note-2: And of course, make sure to use NumPy arrays, and dont loop through the arrays
use bulk operations!
Note-3: It is a good idea to save your results every once in a while (after some iterations e.g.,
each 10 iteration) so you dont lose anything.
Note-4: To help you out, Im going to supply you with an implementation of a function that,
given a document with word count vector x, will use NumPy to compute the vector of probabilities where the jth entry is P r[ document x came from category j|x, and all s].
The code is quite efficient, in that it does not loop through all of the entries in x or in any particular mu. Instead, it uses efficient, bulk NumPy operations.
The Python code (getProbs.py file ) is a function that, given a particular data point xi, as well as the set of j vectors and pi, computes and returns the p vector. So then, if you use the getProb Python function code, all you need to do is to take a sample from a Categorical distribution using the vector getProb() code gives you (using SciPy/NumPy, you would do this by sampling from a Multinomial with one trial thisll give you a NumPy array like [0,0,0,0,1,0,0], where the 1 indicates what category was returned; you can use numpy.nonzero to find the non-zero element efficiently).
Note-5 Finally, let me make one last comment. One good reason that you might not be able to run this on your laptop is if you dont have enough RAM. Storing all 20,000 document vectors as NumPy arrays is going to take 3.2GB of RAM (each document has 20,000 entries because the dictionary size is 20,000 words; 400M entries X 8B each is 3.2GB). If this exceeds your machiness RAM (or if you just cant get it to run fast enough) you can reduce the dictionary size to 10,000 words.
3

3.4 Task 4 (5 Points)
Finally, lets see how accurate your learning algorithm is. For each of the 20 mixture components that you learn, look at all of the documents that are assigned to it (the c i values tell you this assignment). First, print out the number of documents assigned to the cluster. Then print out the top three real categories for all of the documents assigned to the cluster, and the percentage of documents that are assigned to the cluster that fall in that category. For example, you might have a cluster that has 3,123 documents assigned to it, where the top three real categories are sci.space (34%), sci.med (12%), and sci.electronics (5%). Print out such information for each of the components that you learn. Ideally, each component will have a high prevalence of a particular category (or at least, it will have a high prevalence of several related categories).
Finally, write a paragraph telling us if you think that your clustering algorithm worked!
3.5 Turnin
Create a single document that has results for all three tasks. For each task, copy and paste the result that your last Spark job wrote to Amazon S3. Also for each task, for each Spark job you ran, include a screen shot of the Spark History.
Please zip up all of your code and your document (use .zip only, please!), or else attach each piece of code as well as your document to your submission individually.
Please have the latest version of your code on the GitHub. Zip the files from GitHub and submit as your latest version of assignment work to the Blackboard. We will consider the latest version on the Blackboard but it should exactly match your code on the GitHub
4

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] algorithm math python Spark graph Bayesian Assignment 6
$25