5/5 - (1 vote)

Microsoft Word COMP3308-assignment2-2018-final.docx

COMP3308IntroductiontoArtificialIntelligence Semester1,2018

Page1of7

Assignment2:Classification

Deadlines
Submission:5pm,Friday18thMay,2018(week10)

Thisassignmentisworth20%ofyourfinalmark.

Taskdescription
In this assignment youwill implement the KNearestNeighbour andNaveBayes algorithms and

evaluatethemonarealdatasetusingthestratifiedcrossvalidationmethod.Youwillalsoevaluatethe

performanceofotherclassifierson the samedatasetusingWeka.Finally,youwill investigate the

effectoffeatureselection,inparticulartheCorrelationbasedFeatureSelectionmethod(CFS)from

Weka.

Latesubmissionspolicy
Nolatesubmissionsareallowed.

Programminglanguages
YourimplementationcanbewritteninPython,Java,C,C++orMATLAB.Theassignmentwillbetested

ontheUniversitymachines,soyourcodemustbecompatiblewiththelanguageversioninstalledon

thosemachines.Youarenotallowedtouseanyofthebuiltinclassificationlibrariesforthepurposes

ofthisassignment.

Submissionandpairwork
Yourassignmentcanbecompletedindividuallyorinpairs.Seethesubmissiondetailssectionformore

informationabouthowtosubmit.

This assignment will be submitted using the submission system PASTA

(https://comp3308.it.usyd.edu.au/PASTA/). In order to connect to thewebsite, youll need to be

connectedtotheuniversityVPN.YoucanreadthispagetofindouthowtoconnecttotheVPN.PASTA

willallowyoutomakeasmanysubmissionsasyouwish,andeachsubmissionwillprovideyouwith

feedbackoneachofthecomponentsoftheassignment.Yourlastsubmissionbeforetheassignment

deadlinewillbemarked,andthemarkdisplayedonPASTAwillbethefinalmarkforyourcode(12

marks).

1. Data
The dataset for this assignment is the Pima Indian Diabetes dataset. It contains 768 instances

describedby8numericattributes.Thereare twoclasses yesandno.Eachentry in thedataset

correspondstoapatientsrecord;theattributesarepersonalcharacteristicsandtestmeasurements;

the class shows if theperson shows signsofdiabetesornot. Thepatients are fromPima Indian

heritage,hencethenameofthedataset.

AcopyofthedatasetcanbedownloadedfromCanvas.Thereare2filesassociatedwiththedataset.

Thefirstfile,*.names,describesthedata,includingthenumberandthetypeoftheattributesand

classes,aswellastheirmeaning.Thesecond file,*.data,containsthedata itself.Yourtask isto

predicttheclass,wheretheclasscanbeyesorno.

COMP3308IntroductiontoArtificialIntelligence Semester1,2018

Page2of7

Note:TheoriginaldatasetcanbesourcedfromUCIMachineLearningRepository.However,youneed

tousethedatasetavailableonCanvasasithasbeenmodifiedforconsistency.

2. Datapreprocessing
Readthepimaindiansdiabetes.namesfileandlearnmoreaboutthemeaningoftheattributes

andtheclasses.UseWekasinbuiltnormalisationfiltertonormalisethevaluesofeachattributeto

makesuretheyareintherange[0,1].Thenormalisationshouldbedonealongeachcolumn(attribute),

noteachrow(entry).Theclassattribute isnotnormalised itshouldremainunchanged.Savethe

preprocessedfileaspima.csv.

Warning:InordertoensurethatWekacanprocessthedata,youwillneedtoaddheaderstothedata

fileandsaveitasa.csvfile.Youcandothisinanytexteditor.Theheadersshouldberemovedafter

preprocessing.

3. Classificationalgorithms

KNearestNeighbour
TheKNearestNeighbouralgorithmshouldbeimplementedforanyKvalueandshoulduseEuclidean

distanceasthedistancemeasure.Ifthereiseveratiebetweenthetwoclasses,chooseclassyes.

NaveBayes
TheNaveBayesshouldbeimplementedfornumericattributes,usingaprobabilitydensityfunction.

Assumeanormaldistribution, i.e.usetheprobabilitydensityfunctionforanormaldistribution.As

before,ifthereiseveratiebetweenthetwoclasses,chooseclassyes.

Note:Carefullyreadsection6tofindouthowyourprogramwillbeexpectedtoreceiveinputandgive

output.

4. 10foldstratifiedcrossvalidation
Inordertoevaluatetheperformanceoftheclassifiers,youwillhavetoimplement10foldstratified

crossvalidation.Yourprogramshouldbeabletoshowthealgorithmsaverageaccuracyoverthe10

folds.Thisinformationwillberequiredtocompletethereport.

Your implementation of 10fold stratified crossvalidation will be tested based on your pima

folds.csv file. The information about the folds should be stored inpimafolds.csv in the

followingformatforeachfold:

Nameofthefold,fold1tofold10.
Contentsofthefold,witheachentryonanewline.
Asingleblanklinetoseparatethefoldsfromeachother.

Anexampleofthepimafolds.csvfilewouldlookasfollows(madeupdata):

COMP3308IntroductiontoArtificialIntelligence Semester1,2018

Page3of7

fold1
0.588,0.628,0.574,0.263,0.136,0.463,0.054,0.333,yes
0.243,0.274,0.224,0.894,0.113,0.168,0.735,0.321,no

fold2
0.588,0.628,0.574,0.263,0.136,0.463,0.054,0.333,yes
0.243,0.274,0.224,0.894,0.113,0.168,0.735,0.321,no

fold10
0.588,0.628,0.574,0.263,0.136,0.463,0.054,0.333,yes
0.243,0.274,0.224,0.894,0.113,0.168,0.735,0.321,no

Note:Thenumberof instancesper foldshouldnotvarybymorethanone. Ifthetotalnumberof

instancesisnotdivisiblebyten,theremainingitemsshouldbedistributedamongstthefoldsrather

thanbeingplacedinonefold.

5. Featureselection
Correlationbasedfeatureselection(CFS) isamethodforselectingasubsetoftheoriginalfeatures

(attributes). Itsearches forthebestsubsetof features,wherebest isdefinedbyaheuristicwhich

considershowgoodtheindividualfeaturesareatpredictingtheclassandhowmuchtheycorrelate

withtheotherfeatures.Goodsubsetsoffeaturescontainfeaturesthatarehighlycorrelatedwiththe

classanduncorrelatedwitheachother.

Loadthepima.csvfileinWeka,andapplyCFStoreducethenumberoffeatures.Itisavailablefrom

theSelectattributestabinWeka.UseBestFirstSearchasthesearchmethod.SavetheCSVfile

withthereducednumberofattributes(thiscanbedoneinWeka)andnameitpimaCFS.csv.

Warning:Asbefore,inordertoensureWekacanunderstandthedata,youllneedtoaddheaders.

Onceyouaredoneprocessing,removetheheaders

6. Inputandoutput

Input
YourprogramwillneedtobenamedMyClassifier,howevermaybewritteninanyofthelanguages

mentionedintheProgramminglanguagessection.

Yourprogramshouldtake3commandlinearguments.Thefirstargumentisthepathtothetraining

datafile,thesecondisthepathtothetestingdatafile,andthethirdisthenameofthealgorithmto

beexecuted (NB forNaveBayesandkNN for theNearestNeighbour,wherek is replacedwitha

number;e.g.3NN).

For example, if you were to make a submission in Java, your main class would be

MyClassifier.java,andthefollowingareexamplesofpossibleinputstotheprogram:

$javaMyClassifierpima.csvexamples.csvNB
$javaMyClassifierpimaCFS.csvexamples.csv4NN

COMP3308IntroductiontoArtificialIntelligence Semester1,2018

Page4of7

Theinputtestingdatafilewillconsistofseveralnewexamplestotestyourdataon.Thefilewillnot

haveheaders,willhaveoneexampleperline,andeachlinewillconsistofanormalisedvalueforeach

ofthenonclassattributesseparatedbycommas.Anexampleinputfilewouldlookasfollows:

0.588,0.628,0.574,0.263,0.136,0.463,0.054,0.333
0.243,0.274,0.224,0.894,0.113,0.168,0.735,0.321
0.738,0.295,0.924,0.113,0.693,0.666,0.486,0.525

Thefollowingexamplesshowhowtheprogramwouldberunforeachofthesubmissionlanguages,

assumingwewanttoruntheNBclassifier,thetrainingdataisinafilecalledtraining.txt,and

thetestingdataisinafilecalledtesting.txt.

Python(version3.5.3):

pythonMyClassifier.pytraining.txttesting.txtNB

Java(version1.8):

javacMyClassifier.java
javaMyClassifiertraining.txttesting.txtNB

C(gccversion6.3.0):

gcclmwstd=c99oMyClassifierMyClassifier.c*.c
./MyClassifiertraining.txttesting.txtNB

C++(gccversion6.3.0):

g++cMyClassifier.cpp*.cpp*.h
gcclstdc++lmoMyClassifier*.o
./MyClassifiertraining.txttesting.txtNB

MATLAB(R2017b):

mccmoMyClassifierRnodisplayRnojvmMyClassifier
./run_MyClassifier.sh training.txttesting.txtNB

Note:MATLABmustberunthisway(compiledfirst)tospeedupMATLABrunning

submissions.TheargumentsarepassedtoyourMyClassifierfunctionasstrings.For

example,theexampleabovewillbeexecutedasafunctioncalllikethis:

MyClassifier(training.txt,testing.txt,NB)

Output
Yourprogramwilloutputtostandardoutput(a.k.a.theconsole).Theoutputshouldbeoneclass

value(yesorno)perlineeachlinerepresentingyourprogramsclassificationofthecorresponding

lineintheinputfile.Anexampleoutputshouldlookasfollows:

yes
no
yes

COMP3308IntroductiontoArtificialIntelligence Semester1,2018

Page5of7

Note:Theseoutputsareinnowayrelatedtothesampleinputsgivenabove.Ifyouhaveanyquestions

orneedanyclarificationsaboutprograminputoroutput,askaquestiononPiazzaoraskyourtutor.

Since your program will be automatically tested by PASTA, it is important that you follow the

instructionsexactly.

7. Wekaevaluation
InWekaselect10foldcrossvalidation(it isactually10foldstratifiedcrossvalidation)andrunthe

followingalgorithms:ZeroR,1R,kNearestNeighbor(kNN;IBkinWeka),NaveBayes(NB),Decision

Tree (DT; J48 inWeka),MultiLayerPerceptron (MLP)andSupportVectorMachine (SVM;SMO in

Weka).

ComparetheperformanceoftheWekasclassifierswithyourkNearestNeighborandNaveBayes

classifiers.Do this for the casewithout feature selection (usingpima.csv)andwithCFS feature

selection(usingpimaCFS.csv).

8. Report
Youwillhavetodescribeyouranalysisandfindingsinareportsimilartoaresearchpaper.Yourreport

shouldinclude5sections.Thereisnominimumormaximumlengthforthereportyouwillbemarked

onthequalityofthecontentthatyouprovide.

Aim
Thissectionshouldbrieflystatetheaimofyourstudyandincludeaparagraphaboutwhythisstudy

isimportant.

Data
Thissectionshoulddescribethedataset,mentioningthenumberofattributesandclasses.Itshould

alsobrieflydescribetheCFSmethodandlisttheattributesselectedbytheCFS.

Resultsanddiscussion
The accuracy results should be presented (in percentage, using 10fold cross validation) in the

followingtablewhereMy1NN,My3NNandMyNBareyourimplementationsofthe1NN,3NNandNB

algorithms,evaluatedusingyourstratified10foldcrossvalidation.

ZeroR 1R 1NN 3NN NB DT MLP SVM

Nofeature
selection

CFS

My1NN My3NN MyNB

Nofeature
selection

CFS

COMP3308IntroductiontoArtificialIntelligence Semester1,2018

Page6of7

In thediscussion,compare theperformanceof theclassifiers,withandwithout feature selection.

Compare your implementations of kNN and NBwithWekas. Discuss the effect of the feature

selectiondidCFSselectasubsetoftheoriginalfeatures,and ifso,didtheselectedsubsetmake

intuitivesensetoyou?Wasfeatureselectionbeneficial,i.e.diditimproveaccuracy,orhaveanyother

advantages?Whydoyouthinkthisisthecase?Includeanythingelsethatyouconsiderimportant.

Conclusion
Summariseyourmainfindingsand,ifpossible,suggestfuturework.

Reflection
Writeoneortwoparagraphsdescribingthemostimportantthingthatyouhavelearnedthroughout

thisassignment.

9. SubmissionDetails
ThisassignmentistobesubmittedelectronicallyviathePASTAsubmissionsystem.

Individualsubmissionssetup
ThefirstthingyoumustdoiscreateanindividualgrouponPASTA.ThisisduetoalimitationofPASTA.

Tocreateagroup,followtheinstructionsbelow:

1. ClickontheGroupManagementbutton(3peopleicon),nexttothesubmitbutton.

2. Clickontheplusbuttoninthebottomrighttoaddanewgroup.

3. ScrolltothebottomofthelistofgroupsandclickonJoinGroupnexttothegroupyoujust

created.

4. ClickonLockGrouptolockthegroupandstopothersfromjoiningthegroup(optional).

Pairsubmissionssetup
Thefirstthingyoumustdoiscreate/joinagrouponPASTA.Followtheinstructionsbelow:

1. ClickontheGroupManagementbutton(3peopleicon),nexttothesubmitbutton.

2. IfyourpairhasnotyetformedagrouponPASTA,clickontheplusbuttoninthebottomright

toaddanewgroup,otherwisegotostep3.

3. ClickonJoinGroupnexttoyourgroupintheOtherExistingGroupssection.

4. Ifyouwishtostopanyonefromjoiningyourgroup,clickonLockGroup.

Allsubmissions
Yoursubmissionshouldbezippedtogetherinasingle.zipfileandincludethefollowing:

ThereportinPDFformat.
ThesourcecodewithamainprogramcalledMyClassifier.Validextensionsare.java,

.py,.c,.cpp,.cc,and.m.

Threedatafiles:pima.csv,pimaCFS.csvandpimafolds.csv.

Avalidsubmissionmightlooklikethis:

COMP3308IntroductiontoArtificialIntelligence Semester1,2018

Page7of7

UploadyoursubmissiononPASTAunderAssignment2Classification.Makesureyoutickthebox

sayingthatyouresubmittingonbehalfofyourgroup(evenifyoureworkingindividually).The

submissionwontworkifyoudont.

10. Markingcriteria
[12marks]CodebasedonthetestsinPASTA;automaticmarking

[8marks]Report:

[0.5marks]Introduction

Whatistheaimofthestudy?
Whyisthisstudy(theproblem)important?

[0.5marks]Datawellexplained

Datasetbriefdescriptionofthedataset

AttributeselectionbriefsummaryofCFSandalistoftheselectedattributes

[4marks]Resultsanddiscussion

Allresultspresented

Correctanddeepdiscussionoftheresults

Effectofthefeatureselectionbeneficialornot(accuracy,otheradvantages)

Comparisonbetweentheclassifiers(accuracy,otheradvantages)

[1.5marks]Conclusionsandfuturework

Meaningfulconclusionsbasedontheresults

Meaningfulfutureworksuggested

[0.5marks]Reflection(meaningfulandrelevantpersonalreflection)

[1marks]Englishandpresentation

Academicstyle,grammaticalsentences,nospellingmistakes

Goodstructureandlayout;consistentformatting

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] python Java c++ algorithm matlab Microsoft Word COMP3308-assignment2-2018-final.docx

Reviews

Whatsapp Us

[SOLVED] python Java c++ algorithm matlab Microsoft Word COMP3308-assignment2-2018-final.docx

Reviews

Related products

[SOLVED] COP 3223 Program #4: Turtle Time and List Power

[Solved] Program6_1.py

[Solved] List Maintainer

[SOLVED] SciCalculator

[Solved] Problem 3: Who are the Winners

[Solved] Modularized Body Mass Index (BMI) Program in Python