2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)
Program a word-co-occurrence program in Spark and test it for n=2 (n-grams) and just 1 or 2 documents. Scale it up for as many documents as possible.
I have provided two classical texts and a lemmatization file to convert words from one form to a standard or normal form. In this case you will use several passes through the documents. The documents needed for this process are attached.
Pass 1: Lemmetization using the lemmas.csv file
Pass 2: Identify the words in the texts by
Pass 3: Repeat this for multiple documents.
Here is a rough algorithm (non-MR version):
for each word in the text
normalize the word spelling by replacing j with i and v with u throughout
check lemmatizer for the normalized spelling of the word
if the word appears in the lemmatizer
obtain the list of lemmas for this word
for each lemma, create a key/value pair from the lemma and the location where the word was found
else
create a key/value pair from the normalized spelling and the location where the word was found
From word co-occurrence that deals with just 2-grams (or two words co-occurring) increase the co-occurrence to n=3 or 3 words co-occurring. Create a table of n-grams and their locations. Discuss the results and plot the performance and scalability.
n-gram (n =2)
Location
{wordx, wordy}
Document.chap#.line#.position#
Etc.
n-gram (n=3)
Location
{wordx, wordy, wordz}
Document.chap#.line#.position#
In this activity you are required to scale up the word co-occurrence by increasing the number of documents processed from 2 to n. Record the performance of the Apache Spark infrastructure and plot it. A table similar to the above will have thousands of entries. Add the documents incrementally and not all at once.
/docProps/thumbnail.jpeg
Reviews
There are no reviews yet.