Microsoft Word COMP3308-assignment2-2018-final.docx
COMP3308IntroductiontoArtificialIntelligence Semester1,2018
Page1of7
Assignment2:Classification
Deadlines
Submission:5pm,Friday18thMay,2018(week10)
Thisassignmentisworth20%ofyourfinalmark.
Taskdescription
In this assignment youwill implement the KNearestNeighbour andNaveBayes algorithms and
evaluatethemonarealdatasetusingthestratifiedcrossvalidationmethod.Youwillalsoevaluatethe
performanceofotherclassifierson the samedatasetusingWeka.Finally,youwill investigate the
effectoffeatureselection,inparticulartheCorrelationbasedFeatureSelectionmethod(CFS)from
Weka.
Latesubmissionspolicy
Nolatesubmissionsareallowed.
Programminglanguages
YourimplementationcanbewritteninPython,Java,C,C++orMATLAB.Theassignmentwillbetested
ontheUniversitymachines,soyourcodemustbecompatiblewiththelanguageversioninstalledon
thosemachines.Youarenotallowedtouseanyofthebuiltinclassificationlibrariesforthepurposes
ofthisassignment.
Submissionandpairwork
Yourassignmentcanbecompletedindividuallyorinpairs.Seethesubmissiondetailssectionformore
informationabouthowtosubmit.
This assignment will be submitted using the submission system PASTA
(https://comp3308.it.usyd.edu.au/PASTA/). In order to connect to thewebsite, youll need to be
connectedtotheuniversityVPN.YoucanreadthispagetofindouthowtoconnecttotheVPN.PASTA
willallowyoutomakeasmanysubmissionsasyouwish,andeachsubmissionwillprovideyouwith
feedbackoneachofthecomponentsoftheassignment.Yourlastsubmissionbeforetheassignment
deadlinewillbemarked,andthemarkdisplayedonPASTAwillbethefinalmarkforyourcode(12
marks).
1. Data
The dataset for this assignment is the Pima Indian Diabetes dataset. It contains 768 instances
describedby8numericattributes.Thereare twoclasses yesandno.Eachentry in thedataset
correspondstoapatientsrecord;theattributesarepersonalcharacteristicsandtestmeasurements;
the class shows if theperson shows signsofdiabetesornot. Thepatients are fromPima Indian
heritage,hencethenameofthedataset.
AcopyofthedatasetcanbedownloadedfromCanvas.Thereare2filesassociatedwiththedataset.
Thefirstfile,*.names,describesthedata,includingthenumberandthetypeoftheattributesand
classes,aswellastheirmeaning.Thesecond file,*.data,containsthedata itself.Yourtask isto
predicttheclass,wheretheclasscanbeyesorno.
COMP3308IntroductiontoArtificialIntelligence Semester1,2018
Page2of7
Note:TheoriginaldatasetcanbesourcedfromUCIMachineLearningRepository.However,youneed
tousethedatasetavailableonCanvasasithasbeenmodifiedforconsistency.
2. Datapreprocessing
Readthepimaindiansdiabetes.namesfileandlearnmoreaboutthemeaningoftheattributes
andtheclasses.UseWekasinbuiltnormalisationfiltertonormalisethevaluesofeachattributeto
makesuretheyareintherange[0,1].Thenormalisationshouldbedonealongeachcolumn(attribute),
noteachrow(entry).Theclassattribute isnotnormalised itshouldremainunchanged.Savethe
preprocessedfileaspima.csv.
Warning:InordertoensurethatWekacanprocessthedata,youwillneedtoaddheaderstothedata
fileandsaveitasa.csvfile.Youcandothisinanytexteditor.Theheadersshouldberemovedafter
preprocessing.
3. Classificationalgorithms
KNearestNeighbour
TheKNearestNeighbouralgorithmshouldbeimplementedforanyKvalueandshoulduseEuclidean
distanceasthedistancemeasure.Ifthereiseveratiebetweenthetwoclasses,chooseclassyes.
NaveBayes
TheNaveBayesshouldbeimplementedfornumericattributes,usingaprobabilitydensityfunction.
Assumeanormaldistribution, i.e.usetheprobabilitydensityfunctionforanormaldistribution.As
before,ifthereiseveratiebetweenthetwoclasses,chooseclassyes.
Note:Carefullyreadsection6tofindouthowyourprogramwillbeexpectedtoreceiveinputandgive
output.
4. 10foldstratifiedcrossvalidation
Inordertoevaluatetheperformanceoftheclassifiers,youwillhavetoimplement10foldstratified
crossvalidation.Yourprogramshouldbeabletoshowthealgorithmsaverageaccuracyoverthe10
folds.Thisinformationwillberequiredtocompletethereport.
Your implementation of 10fold stratified crossvalidation will be tested based on your pima
folds.csv file. The information about the folds should be stored inpimafolds.csv in the
followingformatforeachfold:
Nameofthefold,fold1tofold10.
Contentsofthefold,witheachentryonanewline.
Asingleblanklinetoseparatethefoldsfromeachother.
Anexampleofthepimafolds.csvfilewouldlookasfollows(madeupdata):
COMP3308IntroductiontoArtificialIntelligence Semester1,2018
Page3of7
fold1
0.588,0.628,0.574,0.263,0.136,0.463,0.054,0.333,yes
0.243,0.274,0.224,0.894,0.113,0.168,0.735,0.321,no
fold2
0.588,0.628,0.574,0.263,0.136,0.463,0.054,0.333,yes
0.243,0.274,0.224,0.894,0.113,0.168,0.735,0.321,no
fold10
0.588,0.628,0.574,0.263,0.136,0.463,0.054,0.333,yes
0.243,0.274,0.224,0.894,0.113,0.168,0.735,0.321,no
Note:Thenumberof instancesper foldshouldnotvarybymorethanone. Ifthetotalnumberof
instancesisnotdivisiblebyten,theremainingitemsshouldbedistributedamongstthefoldsrather
thanbeingplacedinonefold.
5. Featureselection
Correlationbasedfeatureselection(CFS) isamethodforselectingasubsetoftheoriginalfeatures
(attributes). Itsearches forthebestsubsetof features,wherebest isdefinedbyaheuristicwhich
considershowgoodtheindividualfeaturesareatpredictingtheclassandhowmuchtheycorrelate
withtheotherfeatures.Goodsubsetsoffeaturescontainfeaturesthatarehighlycorrelatedwiththe
classanduncorrelatedwitheachother.
Loadthepima.csvfileinWeka,andapplyCFStoreducethenumberoffeatures.Itisavailablefrom
theSelectattributestabinWeka.UseBestFirstSearchasthesearchmethod.SavetheCSVfile
withthereducednumberofattributes(thiscanbedoneinWeka)andnameitpimaCFS.csv.
Warning:Asbefore,inordertoensureWekacanunderstandthedata,youllneedtoaddheaders.
Onceyouaredoneprocessing,removetheheaders
6. Inputandoutput
Input
YourprogramwillneedtobenamedMyClassifier,howevermaybewritteninanyofthelanguages
mentionedintheProgramminglanguagessection.
Yourprogramshouldtake3commandlinearguments.Thefirstargumentisthepathtothetraining
datafile,thesecondisthepathtothetestingdatafile,andthethirdisthenameofthealgorithmto
beexecuted (NB forNaveBayesandkNN for theNearestNeighbour,wherek is replacedwitha
number;e.g.3NN).
For example, if you were to make a submission in Java, your main class would be
MyClassifier.java,andthefollowingareexamplesofpossibleinputstotheprogram:
$javaMyClassifierpima.csvexamples.csvNB
$javaMyClassifierpimaCFS.csvexamples.csv4NN
COMP3308IntroductiontoArtificialIntelligence Semester1,2018
Page4of7
Theinputtestingdatafilewillconsistofseveralnewexamplestotestyourdataon.Thefilewillnot
haveheaders,willhaveoneexampleperline,andeachlinewillconsistofanormalisedvalueforeach
ofthenonclassattributesseparatedbycommas.Anexampleinputfilewouldlookasfollows:
0.588,0.628,0.574,0.263,0.136,0.463,0.054,0.333
0.243,0.274,0.224,0.894,0.113,0.168,0.735,0.321
0.738,0.295,0.924,0.113,0.693,0.666,0.486,0.525
Thefollowingexamplesshowhowtheprogramwouldberunforeachofthesubmissionlanguages,
assumingwewanttoruntheNBclassifier,thetrainingdataisinafilecalledtraining.txt,and
thetestingdataisinafilecalledtesting.txt.
Python(version3.5.3):
pythonMyClassifier.pytraining.txttesting.txtNB
Java(version1.8):
javacMyClassifier.java
javaMyClassifiertraining.txttesting.txtNB
C(gccversion6.3.0):
gcclmwstd=c99oMyClassifierMyClassifier.c*.c
./MyClassifiertraining.txttesting.txtNB
C++(gccversion6.3.0):
g++cMyClassifier.cpp*.cpp*.h
gcclstdc++lmoMyClassifier*.o
./MyClassifiertraining.txttesting.txtNB
MATLAB(R2017b):
mccmoMyClassifierRnodisplayRnojvmMyClassifier
./run_MyClassifier.sh
Note:MATLABmustberunthisway(compiledfirst)tospeedupMATLABrunning
submissions.TheargumentsarepassedtoyourMyClassifierfunctionasstrings.For
example,theexampleabovewillbeexecutedasafunctioncalllikethis:
MyClassifier(training.txt,testing.txt,NB)
Output
Yourprogramwilloutputtostandardoutput(a.k.a.theconsole).Theoutputshouldbeoneclass
value(yesorno)perlineeachlinerepresentingyourprogramsclassificationofthecorresponding
lineintheinputfile.Anexampleoutputshouldlookasfollows:
yes
no
yes
COMP3308IntroductiontoArtificialIntelligence Semester1,2018
Page5of7
Note:Theseoutputsareinnowayrelatedtothesampleinputsgivenabove.Ifyouhaveanyquestions
orneedanyclarificationsaboutprograminputoroutput,askaquestiononPiazzaoraskyourtutor.
Since your program will be automatically tested by PASTA, it is important that you follow the
instructionsexactly.
7. Wekaevaluation
InWekaselect10foldcrossvalidation(it isactually10foldstratifiedcrossvalidation)andrunthe
followingalgorithms:ZeroR,1R,kNearestNeighbor(kNN;IBkinWeka),NaveBayes(NB),Decision
Tree (DT; J48 inWeka),MultiLayerPerceptron (MLP)andSupportVectorMachine (SVM;SMO in
Weka).
ComparetheperformanceoftheWekasclassifierswithyourkNearestNeighborandNaveBayes
classifiers.Do this for the casewithout feature selection (usingpima.csv)andwithCFS feature
selection(usingpimaCFS.csv).
8. Report
Youwillhavetodescribeyouranalysisandfindingsinareportsimilartoaresearchpaper.Yourreport
shouldinclude5sections.Thereisnominimumormaximumlengthforthereportyouwillbemarked
onthequalityofthecontentthatyouprovide.
Aim
Thissectionshouldbrieflystatetheaimofyourstudyandincludeaparagraphaboutwhythisstudy
isimportant.
Data
Thissectionshoulddescribethedataset,mentioningthenumberofattributesandclasses.Itshould
alsobrieflydescribetheCFSmethodandlisttheattributesselectedbytheCFS.
Resultsanddiscussion
The accuracy results should be presented (in percentage, using 10fold cross validation) in the
followingtablewhereMy1NN,My3NNandMyNBareyourimplementationsofthe1NN,3NNandNB
algorithms,evaluatedusingyourstratified10foldcrossvalidation.
ZeroR 1R 1NN 3NN NB DT MLP SVM
Nofeature
selection
CFS
My1NN My3NN MyNB
Nofeature
selection
CFS
COMP3308IntroductiontoArtificialIntelligence Semester1,2018
Page6of7
In thediscussion,compare theperformanceof theclassifiers,withandwithout feature selection.
Compare your implementations of kNN and NBwithWekas. Discuss the effect of the feature
selectiondidCFSselectasubsetoftheoriginalfeatures,and ifso,didtheselectedsubsetmake
intuitivesensetoyou?Wasfeatureselectionbeneficial,i.e.diditimproveaccuracy,orhaveanyother
advantages?Whydoyouthinkthisisthecase?Includeanythingelsethatyouconsiderimportant.
Conclusion
Summariseyourmainfindingsand,ifpossible,suggestfuturework.
Reflection
Writeoneortwoparagraphsdescribingthemostimportantthingthatyouhavelearnedthroughout
thisassignment.
9. SubmissionDetails
ThisassignmentistobesubmittedelectronicallyviathePASTAsubmissionsystem.
Individualsubmissionssetup
ThefirstthingyoumustdoiscreateanindividualgrouponPASTA.ThisisduetoalimitationofPASTA.
Tocreateagroup,followtheinstructionsbelow:
1. ClickontheGroupManagementbutton(3peopleicon),nexttothesubmitbutton.
2. Clickontheplusbuttoninthebottomrighttoaddanewgroup.
3. ScrolltothebottomofthelistofgroupsandclickonJoinGroupnexttothegroupyoujust
created.
4. ClickonLockGrouptolockthegroupandstopothersfromjoiningthegroup(optional).
Pairsubmissionssetup
Thefirstthingyoumustdoiscreate/joinagrouponPASTA.Followtheinstructionsbelow:
1. ClickontheGroupManagementbutton(3peopleicon),nexttothesubmitbutton.
2. IfyourpairhasnotyetformedagrouponPASTA,clickontheplusbuttoninthebottomright
toaddanewgroup,otherwisegotostep3.
3. ClickonJoinGroupnexttoyourgroupintheOtherExistingGroupssection.
4. Ifyouwishtostopanyonefromjoiningyourgroup,clickonLockGroup.
Allsubmissions
Yoursubmissionshouldbezippedtogetherinasingle.zipfileandincludethefollowing:
ThereportinPDFformat.
ThesourcecodewithamainprogramcalledMyClassifier.Validextensionsare.java,
.py,.c,.cpp,.cc,and.m.
Threedatafiles:pima.csv,pimaCFS.csvandpimafolds.csv.
Avalidsubmissionmightlooklikethis:
COMP3308IntroductiontoArtificialIntelligence Semester1,2018
Page7of7
submission.zip
|pima.csv
|pimafolds.csv
|pimaCSF.csv
|report/
|+report.pdf
|MyClassifier.java
+extrapackage/
|MyClass.java
+OtherClass.java
UploadyoursubmissiononPASTAunderAssignment2Classification.Makesureyoutickthebox
sayingthatyouresubmittingonbehalfofyourgroup(evenifyoureworkingindividually).The
submissionwontworkifyoudont.
10. Markingcriteria
[12marks]CodebasedonthetestsinPASTA;automaticmarking
[8marks]Report:
[0.5marks]Introduction
Whatistheaimofthestudy?
Whyisthisstudy(theproblem)important?
[0.5marks]Datawellexplained
Datasetbriefdescriptionofthedataset
AttributeselectionbriefsummaryofCFSandalistoftheselectedattributes
[4marks]Resultsanddiscussion
Allresultspresented
Correctanddeepdiscussionoftheresults
Effectofthefeatureselectionbeneficialornot(accuracy,otheradvantages)
Comparisonbetweentheclassifiers(accuracy,otheradvantages)
[1.5marks]Conclusionsandfuturework
Meaningfulconclusionsbasedontheresults
Meaningfulfutureworksuggested
[0.5marks]Reflection(meaningfulandrelevantpersonalreflection)
[1marks]Englishandpresentation
Academicstyle,grammaticalsentences,nospellingmistakes
Goodstructureandlayout;consistentformatting
Reviews
There are no reviews yet.