Description
INF553Spring2018
Assignment4CommunityDetection
Deadline:04/09201811:59PMPST
AssignmentOverview
In this assignment you are asked to implement theGirvan-Newman algorithmusing the Spark
Frameworkinordertodetectcommunitiesinthegraph.Youwilluseonlyvideo_small_num.csv
datasetinordertofinduserswhohavethesimilarproducttaste.Thegoalofthisassignmentisto
help you understand how to use the Girvan-Newman algorithm to detect communities in an
efficientwaybyprogrammingitwithinadistributedenvironment.
EnvironmentRequirements
Python:2.7Scala:2.11Spark:2.2.1
IMPORTANT:Wewillusetheseversionstocompileandtestyourcode.Ifyouuseotherversions,
therewillbea20%penaltysincewewillnotbeabletogradeitautomatically.
YoucanonlyuseSparkRDD.
Writeyourowncode!
Forthisassignmenttobeaneffectivelearningexperience,youmustwriteyourowncode!I
emphasize thispointbecause youwill beable to findPython implementationsofmostor
perhapsevenalloftherequiredfunctionsontheweb.Pleasedonotlookfororatanysuch
code!Donotsharecodewithotherstudentsintheclass!!
SubmissionDetails
ForthisassignmentyouwillneedtoturninaPython,Java,orScalaprogramdependingonyour
languageofpreference.
Yoursubmissionmustbea.zipfilewithname:
ofyoursubmissionshouldbeidenticalasshownbelow.TheFirstname_Lastname_Description.pdf
filecontainshelpfulinstructionsonhowtorunyourcodealongwithothernecessaryinformation
asdescribedinthefollowingsections.TheOutputFilesdirectorycontainsthedeliverableoutput
filesforeachproblemandtheSolutiondirectorycontainsyoursourcecode.
Datasets
WearecontinuallyusingAmazonReviewdata.ThistimeweuseasubsetofAmazonInstantVideo
category.We have already transferred the string id of user and product to integers for your
convenience.YoushoulddownloadonefilefromBlackboard:
1. video_small_num.csv
ConstructGraph
Eachnoderepresentsauser.Eachedgeisgeneratedinfollowingway:
Invideo_small_num.csv,countthenumberoftimesthattwousersratedthesameproduct.If
thenumberoftimesisgreaterorequivalentto7times,thereisanedgebetweentwousers.
Task1:Betweenness(50%)
YouarerequiredtoimplementGirvan-NewmanAlgorithmtofindbetweennessofeach
edgeinthegraph.Thebetweennessfunctionshouldbecalculatedonlyoncefromthe
originalgraph.
ExecutionExample
The first argument passed to your program (in the below execution) is the path of
video_small_num.csv file (e.g. spark-2.2.1-bin-hadoop2.7/HW4/video_small_num.csv). The
secondinputistheoutputpath(outputpathisthedirectoryofyouroutputfile,notincludingfile
name.e.g.spark-2.2.1-bin-hadoop2.7/HW4/).Followingwepresentexamplesofhowyoucan
run your programwith spark-submit bothwhen your application is a Java/Scala program or a
Pythonscript.
A. ExampleofrunningaJava/Scalaapplicationwithspark-submit:
Noticethattheargumentclassofthespark-submitspecifiesthemainclassofyour
applicationanditisfollowedbythejarfileoftheapplication.
YoushoulduseBetweennessasyourclassnameforthistask.
B. ExampleofrunningaPythonapplicationwithspark-submit:
Resultformat:
Eachlineisatuple,theformatislike(userId1,userId2,betweennessvalue).Thefileisorderedby
thefirstelementinascendingorderandifthefirstelementisthesame,orderedbythesecond
element.Theexampleisasfollows:(theexamplejustshowstheformat,isNOTasolution)
RuntimeRequirement:
<60secTask2:DetectCommunity(50%)Youarerequiredtoimplementbetweennessandmodularityinthistask.Youalsoneedtodividethegraphintosuitablecommunities,whichreachesthehighestmodularity.WhenyouusethefollowingformulatocalculatemodularityofpartitionSofG,youshouldbeawarethatAijshouldremainthesameasoriginalgraph(i.e.Aijdoesnotchangewhileyoudeleteanyedge)ExecutionExampleThe first argument passed to your program (in the below execution) is the path of video_small_num.csv file (e.g. spark-2.2.1-bin-hadoop2.7/HW4/video_small_num.csv). Thesecondinputistheoutputpath(outputpathisthedirectoryofyouroutputfile,notincludingfilename.e.g.spark-2.2.1-bin-hadoop2.7/HW4/).Followingwepresentexamplesofhowyoucanrun your programwith spark-submit bothwhen your application is a Java/Scala program or aPythonscript.A. ExampleofrunningaJava/Scalaapplicationwithspark-submit:Noticethattheargumentclassofthespark-submitspecifiesthemainclassofyourapplicationanditisfollowedbythejarfileoftheapplication. YoushoulduseCommunityasyourclassnameforthetask.B. ExampleofrunningaPythonapplicationwithspark-submit:Resultformat:Eachlistisacommunity,inwhichcontainsuserIds.Ineachlist,theuserIdsshouldbeinascendingorder.AndalllistsshouldbeorderedbythefirstuserIdineachlistinascendingorder.Andexampleisasfollows:(theexamplejustshowstheformat,isNOTasolution)RuntimeRequirement:<60secDescriptionFilePleaseincludethefollowingcontentinyourdescriptionfile:1.MentiontheSparkversionandPythonversion2.DescribehowtorunyourprogramforbothtasksSubmissionDetailsYoursubmissionmustbea.zipfilewithname:
Pleaseincludeallthefilesintherightdirectoryasfollowing:
1. Adescriptionfile:
2. AllScalascripts:
3. AjarpackageforallScalafile:
If you use Scala for all tasks, please make all *.scala file into ONLY ONE
AndDONOTincludeanydataorunrelatedlibrariesintoyourjar.
4. IfyouusePython,thenallpythonscripts:
5. Requiredresultfilesfortask1&2:
GradingCriteria:
1. Ifyourprogramscannotrunwiththecommandsyouprovide,yoursubmissionwillbegraded
basedontheresultfilesyousubmit,andtherewillbean80%penalty
2. Ifthefilesgeneratedarenotsortedbasedonthespecifications,therewillbe20%penalty.
3. Ifyourprogramgeneratesmorethanonefile,therewillbe20%penalty.
4. ifruntimeofyourprogramexceedstheruntimerequirement,therewillbe20%penalty.
5. Ifyoudontprovidethesourcecode,especiallytheScalascripts,therewillbe20%penalty.
6. Youcanuseyourfree5-dayextension.
7. Therewillbe10%bonusifyouuseScalafortheentireassignment.
Reviews
There are no reviews yet.