CMDA-4654 HW9
Due May 5th, 2019 by 11:59pm on Canvas
Instructions
You should must use R Markdown to do the assignment. Failure to compile your document results in a zero. You must turn in both your .Rmd file and your .pdf in the format Lastname_Firstname_HW8.pdf, please put both together in a .zip file with the same naming convention.
Coding Section (40 Points)
Here you will code your own function that can carry out classification using a tree.
Consider the data in Tibetan_Skills.csv which contains measurements on two types of skulls.
skull.dat <- read.csv(“~/Dropbox/teaching/SP_2019/CMDA_4654/Homework/HW9/data/Tibetan_Skulls.csv”)# Initial fitlibrary(tree)skull.tree <- tree(Type ~ ., data = skull.dat)If we look at skill.tree we will see the node id number, the split rule, the number of observations in the corresponding node, the deviance, the classification decision (yval = class) and the probabilities, yprob = (widehat{pi}) for each category (the classification decision is the one that has the highest probability).skull.treenode), split, n, deviance, yval, (yprob)* denotes terminal node1) root 32 44.24 ST1 ( 0.5312 0.4688 )2) Length < 180.75 19 19.56 ST1 ( 0.7895 0.2105 )4) Fheight < 71.25 110.00 ST1 ( 1.0000 0.0000 ) *5) Fheight > 71.25 8 11.09 ST1 ( 0.5000 0.5000 ) *
3) Length > 180.75 13 11.16 ST2 ( 0.1538 0.8462 )
6) Breadth < 141.75 80.00 ST2 ( 0.0000 1.0000 ) *7) Breadth > 141.75 56.73 ST2 ( 0.4000 0.6000 ) *
We can also look at a summary which tells us some useful information.
summary(skull.tree)
Classification tree:
tree(formula = Type ~ ., data = skull.dat)
Variables actually used in tree construction:
[1] LengthFheight Breadth
Number of terminal nodes:4
Residual mean deviance:0.6364 = 17.82 / 28
Misclassification error rate: 0.1875 = 6 / 32
We can plot our tree to see the corresponding information.
# plot tree
plot(skull.tree, lwd = 2)
text(skull.tree, cex = 1.5)
Clearly we need to prune this tree. Lets do that.
# Snipping the tree, remove the nodes/leaves after node 2
skull.snip <- snip.tree(skull.tree, nodes = 2)summary(skull.snip)Classification tree:snip.tree(tree = skull.tree, nodes = 2L)Variables actually used in tree construction:[1] “Length””Breadth”Number of terminal nodes:3 Residual mean deviance:0.9064 = 26.29 / 29 Misclassification error rate: 0.1875 = 6 / 32 # plot treeplot(skull.snip, lwd = 2)text(skull.snip, cex = 1.5)We should remove the other unnecessary node too.# Snipping the tree, removing the nodes/leaves after nodes 2 and 3 from the large treeskull.snip2 <- snip.tree(skull.tree, c(2, 3))summary(skull.snip2)Classification tree:snip.tree(tree = skull.tree, nodes = 2:3)Variables actually used in tree construction:[1] “Length”Number of terminal nodes:2 Residual mean deviance:1.024 = 30.72 / 30 Misclassification error rate: 0.1875 = 6 / 32 plot(skull.snip2, lwd = 2)text(skull.snip2, cex = 1.5)Lets form a confusion matrix for the full tree:predict_frame <- data.frame(skull.dat, prediction = predict(skull.tree, skull.dat, type = “class”))table(predict_frame$Type, predict_frame$prediction) ST1 ST2ST112 5ST2 213Lets form a confusion matrix for the final trimmed tree:predict_frame <- data.frame(skull.dat, prediction = predict(skull.snip2, skull.dat, type = “class”))print( myconfusion_mat <- table(predict_frame$Type, predict_frame$prediction) ) ST1 ST2ST115 2ST2 411Lets classify a skull based upon new observations# A new skull was foundnew.skull <- list(Length = 179, Breadth = 135, Fheight = 73, Fbreadth = 132, Height = 135)# What type of skull is it likely to be?predict(skull.snip2, new.skull, type = “class”)[1] ST1Levels: ST1 ST2Lets compute the deviance which is given by [ -2 * sum n_{ell(x_{i})} ln(widehat{pi}_{y_{i}}) ] where [ widehat{pi}_{y_{i}} = dfrac{1}{ n_{ell(x_{i})} } left( {sumlimits_{y_{k} in ell(x_{i})}} mathbb{I}(y_{k} = y_{i}) right) ] and (n_{ell(x_{i})}) is the number of elements in each leaf.Lets where where this comes from.There are two categories, ST1 and ST2. We can estimate probability of each category using the training data.skull.snip2node), split, n, deviance, yval, (yprob)* denotes terminal node1) root 32 44.24 ST1 ( 0.5312 0.4688 )2) Length < 180.75 19 19.56 ST1 ( 0.7895 0.2105 ) *3) Length > 180.75 13 11.16 ST2 ( 0.1538 0.8462 ) *
# Decision regions
left_leaf_index <- which(skull.dat$Length <= 180.75)right_leaf_index <- which(skull.dat$Length > 180.75)
print( actual_classes_left <- table(skull.dat$Type[left_leaf_index]) )ST1 ST215 4 print( phat_left <- prop.table(actual_classes_left) )ST1 ST2 0.7894737 0.2105263 print( actual_classes_right <- table(skull.dat$Type[right_leaf_index]) )ST1 ST2 211 print( phat_right <- prop.table(actual_classes_right) )ST1 ST2 0.1538462 0.8461538 We already had this information in our confusion matrixmyconfusion_mat ST1 ST2ST115 2ST2 411# We can get the phats the following way:print( myconfusion_mat_props <- prop.table( myconfusion_mat, margin = 2 ) ) ST1 ST2ST1 0.7894737 0.1538462ST2 0.2105263 0.8461538# Deviance-2 * (15*log(0.7895) + 4*log(0.2105) + 2*log(0.1538) +11*log(0.8462) )[1] 30.71922# or more efficiently-2*sum(myconfusion_mat * log(myconfusion_mat_props))[1] 30.71922# Compare with the summary()summary(skull.snip2)$dev[1] 30.71922The deviance in the left split-2*(15*log(0.7895) + 4*log(0.2105))[1] 19.55682The deviance in the right split-2*(2*log(0.1538) +11*log(0.8462))[1] 11.1624skull.snip2node), split, n, deviance, yval, (yprob)* denotes terminal node1) root 32 44.24 ST1 ( 0.5312 0.4688 )2) Length < 180.75 19 19.56 ST1 ( 0.7895 0.2105 ) *3) Length > 180.75 13 11.16 ST2 ( 0.1538 0.8462 ) *
print( skull.snip2$frame )
varndev yval splits.cutleft splits.cutright yprob.ST1 yprob.ST2
1 Length 32 44.23634ST1<180.75 >180.75 0.5312500 0.4687500
2
3
BTW, the null deviance (no splits made) is
-2*sum(table(skull.dat$Type)*log(prop.table(table(skull.dat$Type))))
[1] 44.23634
skull.snip2$frame$dev[1]
[1] 44.23634
Problem 1: Efficiently determine splits in the variables that reduce our deviance
a.(25 of 40 pts) Write a basic function that determines where the optimal split is in a single variable, that is, it reduces the overall deviance. You can call your function my_basic_tree_splitter. Demonstrate that it works on the iris dataset petal length variable by itself and again on the petal width by itself.
If you succeed, youll have split/partitioned exactly once. The method is recursive partitioning which means we need to futher partition until a stopping criterion has been reached which if either a minimum within-node deviance is attained or when the leaves contains fewer than a specified minimum.
b.(15 of 40 pts) Make a new function that will carry out the recursive partitioning steps until minsize = 10. Use your function on the iris dataset using only the petal length and widths. You can check your final answer against the built-in function. You are not permitted to simply call tree() or rpart() or other functions from other libraries within your function.
Your function must be called exactly as I have it below.
mytreefcn <- function(Y, Xmat, minsize = 10) {return( “output_table” = data.frame(output_table) ) }Include both your R functions inside a single file called, Lastname_Firstname_treeHW8.R zipped up within your .zip file.Applied Section (60 points (20 points each))For the problems below, you DO NOT have to use the function you wrote.If the {} library is not working for you, you can instead use the {} library and do equivalent operations (note some functions will be different so youll have to google a bit).Problem 2Consider the Carseats dataset from the ISLR library. Split the dataset into two portions, a random 60% will be the training dataset, and the remaining 40% will be the testing data.a.Use tree() to fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?b.Use cross-validation in order to determine the optimal level of tree complexity. Plot your pruned tree. Does pruning the tree improve the test MSE?c.Use random forests to analyze this data, the default options are fine. What test MSE do you obtain? Use the importance() and varImpPlot() functions to determine which variables are most important.Problem 3Fitting and comparing different classification models to the Caravan dataset ISLR library package. The response variable is purchase. Use the first 1000 observations as a training set and the remaining as test set.a.Fit Random Forests to this dataset. Try at least ten different parameter settings and compare the results on the test set. You may vary the size of the trees, the number of variables sampled at each node, or the number of trees. Comment on the results.b.Fit a logistic regression to this dataset. Evaluate the model on the test set and compare to the Random ForestProblem 4Consider the data in HW8_data_prob4.csv.a.Its clear that this data is ill-suited for regression. Go ahead and fit three polynomial models from degree = 1 to 3.b.Use a regression tree to fit the data.c.Plot the three fits from (a) and the fit from (b) (the function partition.tree()) will be useful here.
Reviews
There are no reviews yet.