MAST90085 Semester 2, 2019. Assignment #2 Instructions:
The assignment contains 1 problem with 4 questions and is worth a total of 20 marks which will count towards 10% of the final mark for the course.
You must submit electronically (instructions will be announced later) a file with all the R code used to produce the answers, so that your tutor can run your code and check that it works and that it does give the claimed result. Failure to do so will result in a significant loss of marks. All R code should be clearly written and commented. Uncommented R code is not acceptable.
A PAPER copy of your assignment must be turned in by 5pm Friday 25 October, 2019. You must complete the online plagiarism form on LMS by 5pm Friday 25 October, 2019. The assignment must be returned in the MAST90085 pigeonhole assigned in the School of Mathematics and Statistics. Make sure you place your assignment in the right pigeonhole (there are three pigeonholes, one per tutorial and you should use only the pigeonhole corresponding to your tutorial).
[1 mark] Your assignment should clearly show your name and student ID number, your tutors name and the time and day of your tutorial class. Your answers must be clearly numbered and in the same order as the assignment questions. Your answers must be easy to read (marks may be deducted for illegible handwriting). These instructions apply also to the R code.
Include all of your working out in your answers otherwise you will lose marks. Provide all R code necessary to answer the questions. All R outputs, including graphs and tables, must be accompanied by your concise and clearly written R code used to produce it. Any graph, table or R code must be accompanied by clear and concise comments.
Use concise text explanations to support your answers. Comments should be brief and concise: marks will be awarded for clarity.
Late assignments will only be accepted under exceptional circumstances with a written application for submitting late and/or a medical certificate. A late penalty may be imposed.
Your lecturer may not help you directly with assignment questions, but may provide some appro- priate guidance.
Data: In the assignment you will analyse some rainfall data. The dataset is available in .txt format on the LMS web page within the Assignment menu. To load the data into R you can use the function read.table() or any command of your choice. You may need to manipulate the data format (data frames or matrices) depending on the task.
The data are separated in a training set and a test set. The training set contain p = 365 explanatory variables X1, . . . , Xp and one class membership (G = 0 or 1) for ntrain = 150 individuals. The test set contain p = 365 explanatory variables s X1, . . . , Xp and one class membership (G = 0 or 1) for ntest = 41 individuals.
In these data, for each individual, X1, . . . , Xp correspond to the amount of rainfall at each of the p = 365 days in a year. Each individual in this case is a place in Australia coming either from the North (G = 0) or from the South (G = 1) of the country. Thus, the two classes (North and South) are coded by 0 and 1.
You will use the training data to fit your models or train classifiers. . Once you have fitted your model or trained your classifiers with the training data, you will need to check how well the fitted models/trained classifiers work on the test data.
The test and training data are all placed in different text files: XGtrainRain.txt, which contains the training X data (values of the p explanatory X-variables) for ntrain = 150 individuals as well as their class (0 or 1) label, and XGtestRain.txt, which contains the test X data (values of the p explanatory X-variables) for ntest = 41 as well as their class (0 or 1) label. The test class membership is provided to you ONLY TO COMPUTE THE ERROR OF CLASSIFICATION of your classifier.
1
Questions [19 marks]
(a) [6 marks] Using the training data in datatrain, construct a quadratic discriminant classifier that predicts the class label (0 or 1) and which uses
1. all the p predictors of the training data Xtrain;
2. q of the p predictors selected by partial least squares (with q chosen by cross-validation for classification). Here, when considering the covariance maximisation problem of PLS, we max- imise the covariance between X = (X1, . . . , Xp)T and Y = 1{G = 1}, the indicator variable that an individual belongs to group 1;
3. q of the p predictors selected by principal component analysis (with q chosen by cross- validation for classification).
Construct all three versions of the classifier, and for each, apply the classifier to the test data Xtest, and compute the resulting classification error. Compare the three methods and explain which one is the most appropriate (the answer to this point is NOT based on the misclassification error).
(b) [6 marks] Using the training data in datatrain, construct a logistic classifier that predicts the class lable (0 or 1) and which uses
1. all the p predictors of the training data Xtrain;
2. q of the p predictors selected by partial least squares (with q chosen by cross-validation for classification). Here, when considering the covariance maximisation problem of PLS, we max- imise the covariance between X = (X1, . . . , Xp)T and Y = 1{G = 1}, the indicator variable that an individual belongs to group 1;
3. q of the p predictors selected by principal component analysis (with q chosen by cross- validation for classification).
Construct all three versions of the classifier, and for each, apply the classifier to the test data Xtest, and compute the resulting classification error. Compare the three methods and explain which one is the most appropriate (the answer to this point is NOT based on the misclassification error).
(c) [5 marks] Using the training data in datatrain, construct a random forest classifier using all p predictor variables of the training data Xtrain. Apply the resulting trained classifier to the test data Xtest, and compute the resulting classification error. First, keep the default value of m but, for the number B of trees that you use, justify your choice using the out of bag classification error. Also, show a graph that illustrates the importance of the Xj variables. Is there an explanation of why those particular Xjs are the most important for classification in this rainfall example? Try running your tree multiple times. Do you always get the same classification error? If yes, why? If not, why and what can you do to make the forest more stable and why?
(d) [2 marks] Compare the percentage of misclassification for each of the classifiers considered in questions (a) to (c). Identify the classifiers that worked the best, those which worked the worst, and comment those results. Provide an explanation of the poorer/better performance of some of the classifiers.
2
Reviews
There are no reviews yet.