COMP931 3 2017 s 2 Project 3
Problem 1 ( 8 pts)) : Find the t op k terms that appear in the most lines
G iven a large text file , each term is contained in several lines . Your task is to find the top k terms that appear in the most lines..
I gnore the letter case,, i..ee..,, consider all words as lower case .
Ignore terms starting with non alphabetical characters , i..ee..,, only consider terms starting with a to z .
Use the
following
split function
to split the documents into terms::
split(([[ s**$$&&##// ,..::;;??!! [ ](()){{}}<<>> ~ _]]++))
You can use the text file pg100..ttxt (aavailable at :
h ttp::////wwww..ggutenberg..oorg//ccache//eepub//1100//ppg100..ttxt
) as the sample input..
Output format::
Your Spark program should generate a list of k key value pairs , where the keys are the terms,, the values are the number of lines containing the term,, and keys and values are separated by t . The key value pairs are sorted in descending order according to the values..
Code format::
Name your scala file as Problem 1 .sscala , the object as Problem 1 , and put it in package comp9313..aass3 . Store the final result in a text file on disk.. Your program should take three parameters:: the input text file , the output folder,, and the value of k .
Problem 2 ( 12 pts)) : Reverse graph e dge direction
Given a directed graph,, reverse the direction of all edges..
Input files::
In the input file,, each line contains a pair of node ids:: FromNodeId t ToNodeId . In the above example,, the input contains four lines:: 1 t 2 , 1 t 3 , 3 t 1 , 3 t 2 .
The sample file tiny web Stanford .ttxt can be downloaded at::
https::////wwebcms3..ccse..uunsw..eedu..aau//CCOMP9313//117s2//rresources//112579
O utput format :
The output is the adjacency list of the reversed graph,, and the nodes are sorted in ascending order in each list.. Format each line as:: NodeId t Neighbor 1 , Neighbor 2 , , N eighbor m , using only one comma to separate the node IDs in the list..
Given the above example,, the output file contains three lines:: 1 t 3 , 2 t 1,, 3 , 3 t 1 .
Code format
Name your scala file as PProblem 2 .sscala , the object as Problem 2 , and put it in package comp9313..aass3 . Store the final result in a text file on disk.. Your program should take two parameters:: the input graph file and the output folder..
Documentation and c ode r eadability
Your source code will be inspected and marked based on readability and ease of understanding.. The documentation ( comments of the codes)) in your source code is also important.. Below is an indicative marking scheme::
Result correctness:: 9 0%%
Code structure,, R eadability,, and Documentation:: 1 0%%
S ubmission::
Deadline:: Sun 1st Oct 21 :559::559
Log in any CSE server (ww illiams or wagner)),, and u se the give command below to submit your solutions :
$ give cs9313 assignment 3 Problem1..sscala Problem2 .sscala
O r you can submit through::
https::////ccgi..ccse..uunsw..eedu..aau//~~ggive//SStudent//ggive..pphp
If you submit your assignment more than once,, the last submission will replace the previous one.. To prove successful submission,, please take a screenshot as assignment submission instructions show and keep it by yourself..
Late submission penalty
10%% reduction of your marks for the 1st day,, 30%% reduction//dday for the following days .
Plagiarism :
The work you submit must be your own work.. Submission of work partially or completely derived from any other person or jointly written with any other person is not permitted.. The penalties for such an offence may include negative marks,, automatic failure of the course and possibly other academic discipline.. Assignment submissions will be examined manually.. Relevant s cholarship authorities will be informed if students holding scholarships are involved in an incident of plagiarism or other misconduct.. Do not provide or show your assignment work to any other person apart from the teaching staff of this subject.. If yo u knowingly provide or show your assignment work to another person for any reason,, and work derived from it is submitted you may be penalized,, even if the work was submitted without your knowledge or consent..
Reviews
There are no reviews yet.