COMP931 3 2017 s 2 Project 4
Set Similarity Join Using MapReduce on AWS
Problem Definition::
Given two collections of records R and S , a similarity function sim (..,, .)),, and a threshold , the set similarity join between R and S,, is to fi nd all record pairs r (ffrom R)) and s (ffrom S)) , such that sim (rr,, s)) >== .
In this project,, you are required to use the Jaccard similarity function to compute sim (rr,, s)).. Given the following example,, and set =00.5,,
the result pairs are (rr 1 , s 1 ) (ssimilarity 0.. 75 ),, (rr 2 , s 2 ) (ssimilarity 0.5)),, (rr 3 , s 1 ) (ssimilarity 0.5)),, (rr 3 , s 2 ) (ssimilarity 0.5))..
Input files :
You are required to do the self join , that is , a single input file is given , in which each line is in format of::
RecordId list<<EElementId>> ,
and this file serves as both R and S..
An example input file is as below (iintegers are separated by space)) :
0 1 4 5 6
1 2 3 6
2 4 5 6
3 1 4 6
4 2 5 6
5 3 5
Th is sample file tiny data .ttxt can be downloaded at::
https::////wwebcms3..ccse..uunsw..eedu..aau//CCOMP9313//117s2//rresources//112771
Another sample input file flickr _small .ttxt can be downloaded at::
https::////wwebcms3..ccse..uunsw..eedu..aau//CCOMP9313//117s2//rresources//112772
O utput::
The output file contains the similar pairs together with their similarities with regard to the given threshold . Each line is in format of (RRecordId 1 ,RRecordId 2 ) t Similarity (RRecordId 1 <RRecordId 2 , which indicates that no duplicate pairs in the result)) . The similarities are of double precision.. The pairs are sorted in ascending order (bby first record and then the second))..
Given the example input data , the output file is like::
(00,2)) t 0.75
(00,, 3 ) t 0.75
( 1 , 4 ) t 0.. 5
( 2 , 3 ) t 0.. 5
( 2 , 4 ) t 0.. 5
Code format :
Name your java file as SetSimJoin..jjava , and put it in the package comp9313..aass4 . Your program should take four parameters:: the input file , the output folder,, and the similarity threshold (ddouble precision)) , the number of reducers .
Cluster configuration::
Create an S3 bucket with name comp9 313..<<YYOUR_STUDENTID>> in AWS . Create a folder project4 in this bucket for holding the input files..
This project aims to let you see the power of distributed computation.. Your code should scale well with the number of nodes used in a cluster.. You are required to create three clusters in AWS to run the same job :
Cluster1 2 node s of instance type m3..xxlarge ( one reduce task ) ;
Cluster2 4 nodes of instance type m3..xxlarge ( three reduce task s)) ;
Cluster3 6 nodes of instance type m3..xxlarge ( five reduce task s))..
Select release EMR 5.9.0 when creating each cluster.. U pload the following data set to your S3 bucket,, and set to 0.85 to run your program :
https::////wwebcms3..ccse..uunsw..eedu..aau//CCOMP9313//117s2//rresources//112773
Record the runtime on each cluster,, and draw a figure where the x axis is the number of nodes you used and the y axis is the time of getting the result,, and store this figure in a file Runtime .jjpg . Please also take a screenshot of running your program on AWS in each cluster as a proof of the runtime.. Compress the three screenshots into a zip file Screenshots..zzip . Briefly describe your optimization techniques in a file Optimization..ppdf .
Note s
Create a project locally in Eclipse,, test everything in your local computer,, and finally do it in AWS EMR..
Documentation and c ode r eadability
Your source code will be inspected and marked based on readability and ease of understanding.. The efficiency and scalability of this project is very important and will be evaluated as well . Below is an indicative marking scheme::
Result correctness:: 7 0%%
Efficiency and Scalability:: 2 0%%
Code structure,, Readability,, and Documentation:: 10%%
S ubmission::
Deadline:: Sunday 29 th Oct 09::559::559 P M
Log in any CSE server (ww illiams or wagner)),, and u se the give command below to submit your solutions :
$ give cs9313 assignment 4 SetSimJoin..jjava Runtime..jjpg Screenshots..zzip Op timization..ppdf
O r you can submit through::
https::////ccgi..ccse..uunsw..eedu..aau//~~ggive//SStudent//ggive..pphp
If you submit your assignment more than once,, the last submission will replace the previous one.. To prove successful submission,, please take a screenshot as assignment submission instructions show and keep it by yourself..
Late submission penalty
10%% reduction of your marks for the 1st day,, 30%% reduction//dday for the following days .
Plagiarism :
The work you submit must be your own work.. Submission of work partially or completely derived from any other person or jointly written with any other person is not permitted.. The penalties for such an offence may include negative marks,, automatic failure of the course and possibly other academic di scipline.. Assignment submissions will be examined manually.. Relevant scholarship authorities will be informed if students holding scholarships are involved in an incident of plagiarism or other misconduct.. Do not provide or show your assignment work to any other person apart from the teaching staff of this subject.. If you knowingly provide or show your assignment work to another person for any reason,, and work derived from it is submitted you may be penalized,, even if the work was submitted without yo ur knowledge or consent..
Reviews
There are no reviews yet.