[SOLVED] 程序代写代做代考 algorithm 2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)

30 $

File Name: 程序代写代做代考_algorithm_2._(35_points)_Featured_activity:_Analysis_of_Latin_documents_for_word-co-occurrence_(Day_3,4_:_3-_4_hours_each_day).zip
File Size: 1422.42 KB

SKU: 2118041688 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:

2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)
Program a word-co-occurrence program in Spark and test it for n=2 (n-grams) and just 1 or 2 documents. Scale it up for as many documents as possible.
I have provided two classical texts and a lemmatization file to convert words from one form to a standard or normal form. In this case you will use several passes through the documents. The documents needed for this process are attached.
Pass 1: Lemmetization using the lemmas.csv file
Pass 2: Identify the words in the texts by for two documents.
Pass 3: Repeat this for multiple documents.
Here is a rough algorithm (non-MR version):
for each word in the text
normalize the word spelling by replacing j with i and v with u throughout
check lemmatizer for the normalized spelling of the word
if the word appears in the lemmatizer
obtain the list of lemmas for this word
for each lemma, create a key/value pair from the lemma and the location where the word was found
create a key/value pair from the normalized spelling and the location where the word was found
From word co-occurrence that deals with just 2-grams (or two words co-occurring) increase the co-occurrence to n=3 or 3 words co-occurring. Create a table of n-grams and their locations. Discuss the results and plot the performance and scalability.
n-gram (n =2)

{wordx, wordy}


n-gram (n=3)

{wordx, wordy, wordz}

In this activity you are required to “scale up” the word co-occurrence by increasing the number of documents processed from 2 to n. Record the performance of the Apache Spark infrastructure and plot it. A table similar to the above will have thousands of entries. Add the documents incrementally and not all at once.



There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 程序代写代做代考 algorithm 2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)
30 $