Files to create
S tudents should create a .RR file and a .ddoc//..ddocx//..ppdf file.. These should be named
FIT5197_Asst2_sStudentId.. (RR,, .ppdf,, .ddoc,, .ddocx)).. (PPlease note the underscore character,, `__..)) For example,, if your StudentId is 12345678,, then your .RR file will be called FIT5197_Asst2_s12345678..RR and (iif your other file is .ppdf)) your other file would be called FIT5197_Asst2_s1 2345678..ppdf .
Include any program documentation describing your R code and//oor your R program output,, and//oor any assumptions (ee..gg..,, ways of dealing with missing `NNA or `== data values,, etc..)) in this assignment .RR file..
If you wish to submit any additional plots,, then also submit a second file (aas an MS Word document or as a .ppdf file)),, FIT5197_Asst2_sStudentId..ddoc or FIT5197_Asst2_sStudentId..ppdf . Any further documentation and details can be included and elaborated upon there.. Your .ddoc//..ddocx//..ppdf file should contain a minimum of 100 words and a maximum of 2000 words.. Any work done using Weka should be included in this .ddocx//..ppdf file..
Anyone wishing to use LaTeX//LLaTeX2e//ppdflatex rather than .ddocx//..ddoc to generate t he .ppdf should first gain consent from both their lecturer and their tutor..
If you find JupyterHub to be helpful with R and RStudio,, then feel free to do development work in
JupyterHub,, but make sure that your submitted files are in R and not .iipynb an d not from JupyterHub..
Students wishing to use any package other than R (ee..gg..,, Weka or Netica)) should first gain consent from both their lecturer and their tutor..
General note about the exercises and questions to follow : There might be less than little guidance about what kinds and sorts of statistical tests or functions you should use with regard to (ee..gg..)) justifying your answer((ss)).. Some notions for possible consideration might possibly include (ee..gg..)) at least some of plots,, graphing of plots,, tests of correla tion (22 variables at a time)) or (ttypically more strongly)) independence,, missing data (rrecall Week
2)),, cross validation,, (cclassical or other)) hypothesis testing (ee..gg..,, is 1 2 = 0??,, etc..)),, least squares,, maximum likelihood,, measures of model complexity,, etc.. That said,, again,, there might be less than little guidance about what kinds and sorts of statistical tests or functions you should use..
Leading into Exercise 1 and Ex ercise 2,, mixture modelling can be thought of as a more general form of clustering.. Clusters are sometimes referred to as classes and,, in mixture modelling,, they are sometimes also called components..
Throughout Exercise 1 and elsewhere in your Assign ment 2,, make sure to show all software code used to obtain your answers.. Make sure to justify your answer((ss)) regarding what you think is the best clustering in your .ddoc//..ppdf file..
Exercise 1 (110 + 10 + 8 + 6 = 34 marks))
See the file `FFIT5197_2ndSem2017_Qu1 (iin the given format or formats)) for both Exercise 1 and Exercise 2..
1a : Cluster this data using k means in R (oor possibly an alternative software package if agreed to by both your lecturer and your tutor)).. Make sure to show both your R code and your output (RR output,, any plots,, etc..))..
1b : Make a case for the best value of k and the best k means clustering that you could find.. Explain any criteria that you used to make this choice,, and give any relevant compar isons..
1c : Use classical hypothesis testing to determine whether all the class centroids were equal or not.. Report the result((ss)) of your hypothesis test((ss))..
What do you conclude from such testing??
1d : State all the parameter values necessary to f ully describe your k means model (ee..gg..,, the value of k and everything else necessary to describe your model))..
Exercise 2 (110 + 10 + 8 + 7 + 6 = 41 marks))
2a : Cluster this data (ffrom `FFIT5197_2ndSem2017_Qu1)) using Gaussian mixture modelling using R (oor possibly an alternative software package if agreed to by both your lecturer and your tutor))..
2b : Make a case for the best value of k and the best Gaussi an mixture model that you could find.. Explain any criteria that you used to make this choice,, and give any relevant comparisons..
2c : Use classical hypothesis testing to determine whether all the class means were equal or not.. Report the result((ss)) of y our hypothesis test((ss))..
What do you conclude from such testing??
2d : State all the parameter values necessary to fully describe your mixture model (ee..gg..,, the value of k and everything else necessary to describe your model))..
2e : Did you get the sa me number of classes in Question 1 as in Question 2??
Compare and contrast your results from Questions 1 and 2..
Leading into Exercise 3,, the University of California at Irvine (UUCI)) machine learning repository (aat
www..iics..uuci..eedu//~~mmlearn
or
http::////aarchive..iics..uuci..eedu//mml//iindex..pphp
)
is a location where many data sets are stored..
Exercise 3 (110 + 8 + 7 = 25 marks))
The `cchronic kidney disease data set is located at
http::////aarchive..iics..uuci..eedu//mml// datasets//CChronic_Kidney_Disease
It has 400 instances (oor items,, or data things)),, each of which has 25 attributes.. The main purpose of the data set is to use the first 24 attr ibutes as explanatory variables to model the 25 th attribute (tthe response variable)) and predict its value.. The model could be a `yyes//nno (cckd//nnotckd)) model or a probabilistic model..
Exercises 3a and 3b will both ask you to build a model (ee..gg..,, logistic regression model,, decision tree model [oor other model with splits]],, something else,, etc..)) to use the explanatory variables to model or predict the response variable.. For each of these two models (ffrom 3a and 3b ),, give all the details necess ary to describe the model (ee..gg..,, any intercepts and other coefficients in a regression model,, any and all splits in a decision tree model,, etc..)).. Make sure to give a clear description in your .ddoc//..ddocx//..ppdf .
3a : Build a model (ee..gg..,, logistic regre ssion model,, decision tree model or other model that splits the data into groups,, something else,, etc..)) to use the explanatory variables to model or predict the response variable.. If possible,, include probability estimates with your model..
3b : Build another model (ee..gg..,, logistic regression model,, decision tree model,, something else,, etc..)) to use the explanatory variables to model or predict the response variable.. If possible,, include probability estimates with your model.. (MMake sure that this model in 3b is different to your model in 3a .))
3c : Justify as best you can and as clearly as you can the model ( 3a or 3b ) which you think is best.. Further justify why you endorse the model you have chosen over other possible alternatives..
If you access material from any other studies on this data,, acknowledge and cite this properly,, including details of any published work.. Students are reminded of Monash Universitys policies on academic integrity..
________
Reviews
There are no reviews yet.