Name: [SOLVED] algorithm 2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)
Brand: Assignment Chef
SKU: 2118041688
Price: 25 USD
Availability: InStock
Rating: 5 (1 reviews)

[SOLVED] algorithm 2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)

$25

File Name: algorithm_2___35_points__Featured_activity__Analysis_of_Latin_documents_for_word_co_occurrence__Day_3_4___3__4_hours_each_day_.zip
File Size: 1186.92 KB

5/5 - (1 vote)

2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)
Program a word-co-occurrence program in Spark and test it for n=2 (n-grams) and just 1 or 2 documents. Scale it up for as many documents as possible.
I have provided two classical texts and a lemmatization file to convert words from one form to a standard or normal form. In this case you will use several passes through the documents. The documents needed for this process are attached.
Pass 1: Lemmetization using the lemmas.csv file
Pass 2: Identify the words in the texts by for two documents.
Pass 3: Repeat this for multiple documents.
Here is a rough algorithm (non-MR version):
for each word in the text
normalize the word spelling by replacing j with i and v with u throughout
check lemmatizer for the normalized spelling of the word
if the word appears in the lemmatizer
obtain the list of lemmas for this word
for each lemma, create a key/value pair from the lemma and the location where the word was found
else
create a key/value pair from the normalized spelling and the location where the word was found
From word co-occurrence that deals with just 2-grams (or two words co-occurring) increase the co-occurrence to n=3 or 3 words co-occurring. Create a table of n-grams and their locations. Discuss the results and plot the performance and scalability.
n-gram (n =2)
Location

{wordx, wordy}
Document.chap#.line#.position#

Etc.

n-gram (n=3)
Location

{wordx, wordy, wordz}
Document.chap#.line#.position#

In this activity you are required to scale up the word co-occurrence by increasing the number of documents processed from 2 to n. Record the performance of the Apache Spark infrastructure and plot it. A table similar to the above will have thousands of entries. Add the documents incrementally and not all at once.

/docProps/thumbnail.jpeg

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] algorithm 2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)

Reviews

Related products

[Solved] List Maintainer

[Solved] Indel

[SOLVED] pakudex

[Solved] Modularized Body Mass Index (BMI) Program in Python

[Solved] Python Assignment-Financial Products and Markets

[Solved] Python program to manage information about baseball players