Assignment2
ClusterandCloudComputingAssignment2
AustralianCityAnalytics
Background
Indevelopmentanddeliveryofnon-trivialsoftwaresystems,workingaspartofateamisgenerally
(typically!) the norm. This assignment is very much a group project. Students will be put into
softwareteamstoworkontheimplementationofthesystemdescribedbelow.Thesewillbeteams
of up to 5 students. In this assignment, students need to organize their team and their collective
involvementthroughout.Thereisnoteamleaderassuch,butteamsmaydecidetosetupprocesses
foragreeingontheworkandwhodoeswhat.Understandingthedependenciesbetweenindividual
efforts and their successful integration is key to the success of the work and for software
engineeringprojectsmoregenerally.
AssignmentDescription
The software engineering activity builds on the lecture materials describing Cloud systems and
especiallytheNeCTARResearchCloudanditsuseofOpenStack;ontheTwitterAPIs,andCouchDB
and the kinds of data analytics (e.g.MapReduce) that CouchDB supports aswell as data from the
AustralianUrbanResearchInfrastructureNetwork(AURINhttps://portal.aurin.org.au).Thefocus
ofthisassignmentistoharvestasmanytweetsaspossiblefromacrossthecitiesofAustraliaonthe
NeCTARResearchCloudandundertakea varietyof socialmediadata analytics scenarios that tell
interesting stories of life in your cities and importantly how the Twitter data can be used
alongside/comparedwith/augment the data availablewithin the AURIN platform to improve our
knowledgeof life in thecitiesofAustralia.Teamsareexpected todownloaddata from theAURIN
platformandincludethisintotheirCouchDBdatabaseforanalysiswithTwitterdata.
The teams should develop a Cloud-based solution that exploits a multitude of virtual machines
(VMs) across the NeCTAR Research Cloud for harvesting tweets through the Twitter APIs (using
boththeStreamingandtheSearchAPIinterfaces).Theteamsshouldproduceasolutionthatcanbe
run(inprinciple)acrossanynodeoftheNeCTARResearchCloudtoharvestandstoretweets.Teams
havebeenallocatedfourmediumsizedVMswith8cores(32Gbmemorytotal)andupto250Gbof
volumestorageand100Gbofobjectstorage.AllstudentshaveaccesstotheNeCTARResearchCloud
as individual users and can test/develop their applications using their own (small) VM instances.
(Rememberingthatthereisnopersistenceinthesesmall,freeanddynamicallyallocatedVMs).
ThesolutionshouldincludeaTwitterharvestingapplicationforany/allofthecitiesofAustralia.The
teamsareexpectedtohavemultipleinstancesofthisapplicationrunningontheNeCTARResearch
Cloud together with an associated CouchDB database containing the amalgamated collection of
Tweets from theharvester applications. TheCouchDB setupmaybe a singlenodeor a replicated
setup.Akeyaspectofthisisinremovingduplicatetweets, i.e.thesystemshouldbedesignedsuch
thatduplicatetweetswillnotarise.
Teams are also expected to develop a range of analytic scenarios, e.g. using the MapReduce
capabilitiesofferedbyCouchDBfortheirallocatedcity.Allteamsmustsupportsentimentanalysis
of their city, e.g. searching for tweets containingpositive sentiments (happy, ecstatic,),negative
sentiments (unhappy, terrible, ) or emoticons like ;o), :o), :), or :o(, >:o( etc and establishing
whetherpeoplearehappier inthemorningor inthenight timeor if therearepartsof theircities
that arehappier thanothers. Correlating suchdatawithAURINdata shouldbe supported.Teams
should actively exploremore advanced solutions for sentiment analysis rather than simple term
searching, e.g. not happy is a negative sentiment. In addition to this sentiment analysis scenario,
teams should explore other scenarios based on their cities using the AURIN data. Teams are
encouragedtobecreativehere.Aprizewillbeawardedforthemostinterestingscenariosidentified!
Forexampleteamsmaylookatscenariossuchas:
Whichsuburbhasthemosttweetersanddoesthiscorrelatewithwhatwemightexpectfrom
the population demographic of the suburb from AURIN, e.g. more young people live in a
givenareasowemightexpectaproportionateincreaseinthenumberoftweets(assuming
youngpeopletweetmore)?
Does the different languages used when tweeting correlate with the cultures we would
expecttofindinthoseareas,e.g.moreChineseliveinBoxHillinMelbournehencewewould
expecttoseefortweetstaggedasChinesefromthosesuburbs,orItaliansinCarltonetc?
Arepeoplehappier(expressmorepositivesentiment)inareaswithmorewealth?
Is there a correlationbetween crime related tweets andofficial crime statistics across the
suburbsofMelbourne?
Isthereacorrelationbetweenalcoholrelatedtweetsandlocationsofplacestobuyalcohol
(bottleshops)?
Doeslanguageuse,e.g.vulgarwordsusedinTwitterhappenmoreorlessinwealthyorpoor
areas?
Theaboveareexamplesstudentsmaydecidetocreatetheirownanalyticsbasedonthedatathey
obtain. Students are not expected to build advanced general purpose data analytic services that
cansupportanyscenario,butshowhowtoolslikeCouchDBwithtargeteddataanalysiscapabilities
likeMapReducewhenprovidedwithsuitableinputscanbeusedtocapturetheessenceof life ina
city.TeamsareencouragedtocombinetwitterdatawithAURINdataandpotentiallyotherdataof
relevancetothecity,e.g.informationonweather,sportevents,TVshows,visitingcelebrities,stock
marketrise/fallsetc.
Theresultoftheassignmentwillbeafullypopulatedinstance(singular)ofaCouchDBwitharange
ofdataanalyticsstoriesassociatedwiththeselectedcities.Afront-endwebapplicationisrequired
forvisualisingthesedatasets/scenarios.
For the implementation teams are recommended to use a commonly understood language across
teammembersmost likely JavaorPython. InformationonbuildingandusingTwitterharvesters
can be found on theweb, e.g. see https://dev.twitter.com/ and related links to resources such as
Tweepy and Twitter4J. Teams are free to use any pre-existing software systems that they deem
appropriate for the analysis. This can include sentiment analysis libraries, gender identification
libraries, andmachine learning systems aswell as front-end Javascript libraries and visualisation
capabilities,e.g.Googlemaps.
ErrorHandling
Issues and challenges in using the NeCTAR Research Cloud for this assignment should be
documented. You shoulddescribe indetail the limitationsofmining twitter content and language
processing (e.g. sarcasm). You should outline any solutions developed to tackle such scenarios.
Removing duplicates of tweets should be handled. The databasemay however contain re-tweets.
You should demonstrate howyou tackledworkingwithin the quota imposed by theTwitterAPIs
throughtheuseoftheCloud.Youshoulddescribehowyoursystemwasdesignedtoberobustand
provideanydegreesoffaulttolerance.
Finalpackaginganddelivery
You should collectively write a team report on the application developed and include the
architecture, the systemdesignand thediscussions that lead into thedesign.Youshoulddescribe
the roleof the teammembers in thedeliveryof the systemandwhere the teamworkedwell and
whereissuesaroseandhowtheywereaddressed.Theteamshouldillustratethefunctionalityofthe
system througha rangeof scenariosandexplainwhyyouchose the specific examples.Teamsare
encouraged to write this report in the style of a paper than can ultimately be submitted to a
conference/journal.
Eachteammemberisalsoexpectedtocompleteaconfidentialreportontheirroleintheprojectand
theexperiencesinworkingwiththeirindividualteammembers.Thiswillbehandedinseparatelyto
thefinalteamreport.(Thisisnottobeusedtoblamepeople,buttoensurethatallteammembers
areabletoprovidefeedbackandtoensurethatnoteamhasanymemberthatdoesnothing!!!).
Thelengthoftheteamreportisnotfixed.Giventhelevelofcomplexityoftheassignmentandtotal
valueoftheassignmentasuitableestimateisareportintherangeof20-25pages.Atypicalreport
willcomprise:
Adescriptionofthesystemfunctionalities,thescenariossupportedandwhy,togetherwith
graphical results, e.g. pie-charts/graphs of Tweet analysis and snapshots of the web
apps/mapsdisplayingcertainTweetscenarios;
Asimpleuserguidefortesting(includingsystemdeploymentandenduserinvocation/usage
ofthesystems);
Systemdesignandarchitectureandhow/whythiswaschosen;
AdiscussionontheprosandconsoftheNeCTARResearchCloudandtoolsandprocessesfor
imagecreationanddeployment;
TeamsshouldalsoproduceavideooftheirsystemthatisuploadedtoYouTube(thesevideos
canlastlongerthantheNeCTARdeploymentsunfortunately!)
Itisimportanttoputyourcollectiveteamdetails(team,city,names,surnames,studentids)in:
theheadpageofthereport;
asaheaderineachofthefilesofthesoftwareproject.
Individual reports describing your role and your teams contributions should be handed in
separately.
ImplementationRequirements
Teamsareexpectedtouse:
aversion-controlsystemsuchasGitHubforsharingsourcecode.
MapReducebased implementations foranalyticswhereappropriate,usingCouchDBsbuilt
in MapReduce capabilities. You may also consider using Hadoop/Spark for this task if
desired.
Theentiresystemshouldhavescripteddeploymentcapabilities.Thismeansthatyourteam
willprovideascript,which,whenexecuted,willcreateanddeploythevirtualmachinesand
orchestratethesetupofallnecessarysoftwareonsaidmachines(e.g.CouchDB,thetwitter
harvesters,webserversetc.)tocreateaready-to-runsystem.Notethatthissetupneednot
populate thedatabase,butdemonstrateyourability toorchestrate thenecessary software
environment on the NeCTAR Research Cloud. Teams should use Ansible
(http://www.ansible.com/home)forthistask.
Theserversideofyouranalyticswebapplicationmayexposeitsdatatotheclientthrougha
ReSTfuldesign.AuthenticationorauthorizationisNOTrequiredforthewebfrontend.
Teamsarealsoencouragedtodescribe:
Howfault-tolerantisyoursoftwaresetup?Isthereasinglepoint-of-failure?
Canyourapplicationandinfrastructuredynamicallyscaleouttomeetdemand?
Deadline
OnecopyoftheteamassignmentistobesubmittedthroughtheLMS. Thezipfilemustbenamed
withyourteam,i.e.
Individual reports describing your role and your teams contributions should be submitted via
PRAZE on the LMS. These individual reports will be completion of web based forms and not
Word/PDFdocumentsetc.
Thedeadlineforsubmittingtheteamassignmentis:Thursday11thMay(by1pm!).
Marking
Themarkingprocesswillbestructuredbyevaluatingwhethertheassignment(application+report)
iscompliantwiththespecificationgiven.Thisimpliesthefollowing:
A working demonstration of the Cloud-based solution with dynamic deployment 25%
marks
AworkingdemonstrationoftweetharvestingandCouchDButilizationforspecificanalytics
scenarios25%marks
Detaileddocumentationonthesystemarchitectureanddesign20%
ReportandwriteupdiscussionincludingprosandconsoftheNeCTARResearchCloudand
supportingtwitterdataanalytics20%marks
Properhandlingoftheerrorsandremovalofduplicatetweets10%marks
The (confidential) assessment by your peers in your teamonPRAZEwill be used toweight your
individualscoresaccordingly.
Timelinessinsubmittingtheassignmentintheproperformatisimportant.A10%deductionper
daywillbemadeforlatesubmissions.
DemonstrationScheduleandVenue
Thestudentteamsarerequiredtogiveapresentation(withafewslides)andademonstrationofthe
workingapplication.ThisshouldincludethekeyTwitteranalyticsscenariossupportedaswell the
designandimplementationchoicesmade.Eachteamhasupto15minutes topresenttheirwork.
This will take place on Thursday 11th May (12 teams present) and 18th May (12 teams
present).Notethatgiventhenumbersofteamsthisyear,notallteamswillbeabletopresent
howeverall teamsshouldbeprepared topresenton11thMay!!!Iwill randomly identify a
teamontheday(usingarandomnumbergeneratorforfairness!!!)
Asateam,youarefreetodevelopyoursystem(s)whereyouaremorecomfortablewith(athome,
on your PC/laptop, in the labs) but obviously the demonstration should work on the NeCTAR
ResearchCloud.
Reviews
There are no reviews yet.