STAT340 Discussion 13: Tree-based Methods
STAT340 Discussion 13: Tree-based Methods
Copyright By Assignmentchef assignmentchef
Keith Levin and Wu
Link to source file
In lecture, we saw our first example of tree-based methods: regression trees. This discussion section will give you practice building regression trees using the built-in tools in R. The notes largely follow the lab in Section 8.3.2 in ISLR.
The two most commonly used packages in R for building regression trees are tree and rpart (short for recursive partitioning remember that regression trees really just partition the data space). rpart seems to be eclipsing tree in terms of popularity, so well use that. See here to read about the tree package if youre interested. The lab in ISLR uses the tree package, if you want a nice example of how to use it.
The documentation for rpart is here. It should be available in your R installation, but if not, you can install it in the usual way.
library(rpart)
# See documentation with ?rpart.
The Boston Housing data set is a famous data set that contains median housing prices for approximately 500 towns in the Boston area (collected in the 70s these houses are a lot more expensive, now). It is included in the MASS library. Lets load it, and take a few minutes to read the documentation to make sure we understand what all the variables mean.
data(Boston)
Were going totry and predict the price, which is stored in Boston$medv, short for median value, the median price of all homes in the town, in thousands of dollars.
hist( Boston$medv)
Question: looking at this histogram, should we consider a variable transformation? Why or why not? Discuss.
Youll see some analyses of this data set out there that take logarithms of these prices. Well leave them as is, if for no other reason than for keeping our analysis roughly similar to the worked example in your textbook, but this kind of thing is always worth considering when you have price or salary data.
Setting up the data
Well split the data into a train and test set, to demonstrate effects of overfitting. Well set aside one fifth of the data as test data, and the rest will be train.
test_inds <- sample(1:nrow(Boston), nrow(Boston)/5);Boston.test <- Boston[test_inds,];Boston.train <- Boston[-test_inds,];head(Boston.test)##crim zn indus chas noxrmagedis rad tax ptratioblack lstat## 262 0.53412 203.970 0.647 7.520 89.4 2.1398 5 26413.0 388.377.26## 335 0.0373805.190 0.515 6.310 38.5 6.4584 5 22420.2 389.406.75## 249 0.16439 225.860 0.431 6.433 49.1 7.8265 7 33019.1 374.719.52## 287 0.01965 801.760 0.385 6.230 31.5 9.0892 1 24118.2 341.60 12.93## 420.1274406.910 0.448 6.7702.9 5.7209 3 23317.9 385.414.84## 452 5.441140 18.100 0.713 6.655 98.2 2.355224 66620.2 355.29 17.73## medv## 262 43.1## 335 20.7## 249 24.5## 287 20.1## 4226.6## 452 15.2head(Boston.train)##crim zn indus chas noxrmagedis rad tax ptratioblack lstat## 1 0.00632 18.02.310 0.538 6.575 65.2 4.0900 1 29615.3 396.904.98## 2 0.027310.07.070 0.469 6.421 78.9 4.9671 2 24217.8 396.909.14## 4 0.032370.02.180 0.458 6.998 45.8 6.0622 3 22218.7 394.632.94## 5 0.069050.02.180 0.458 7.147 54.2 6.0622 3 22218.7 396.905.33## 6 0.029850.02.180 0.458 6.430 58.7 6.0622 3 22218.7 394.125.21## 7 0.08829 12.57.870 0.524 6.012 66.6 5.5605 5 31115.2 395.60 12.43Fitting the treeFitting a regression tree with rpart is largely analogous to fitting logistic regression with glm. We specify a formula, a data set, and a method (really a loss function).rt <- rpart( medv ~ ., data=Boston.train, method=’anova’)# method=’anova’ tells rpart to use the squared errors loss we saw in lecture.# If we just naively print out the tree, we see a text representation of its node splits.## node), split, n, deviance, yval## * denotes terminal node##1) root 405 32961.0200 22.31309##2) lstat>=9.725 2415911.4170 17.37095
##4) lstat>=19.23 671280.6070 12.44776
##8) nox>=0.603 49 508.2253 10.72245 *
##9) nox< 0.603 18 229.4644 17.14444 *##5) lstat< 19.23 1742381.5670 19.26667## 10) lstat>=15 68 739.9024 17.02353 *
## 11) lstat< 15 1061080.0170 20.70566 *##3) lstat< 9.725 164 12513.2000 29.57561##6) rm< 6.941 1143578.7230 25.53596## 12) age< 90.25 1071642.4970 24.74019## 24) rm< 6.531 67 619.3346 22.89104 *## 25) rm>=6.531 40 410.3338 27.83750 *
## 13) age>=90.25 7 832.7200 37.70000 *
##7) rm>=6.941 502832.5800 38.78600
## 14) rm< 7.437 28 622.8011 34.33214 *## 15) rm>=7.437 22 947.4345 44.45455 *
Okay, you can probably figure out whats going on there if you stare at it enough, but I would much rather just look at the tree itself
Okay, but where are the labels? If you think back to our discussion of dendrograms and hierarchical clustering, youll recall that R is generally bad at handling trees. You need to tell Rs plotting functions about the text labels separately.
# uniform-=TRUE helps to prevent the node labels from bumping up against the
# each other.
plot(rt, uniform=TRUE, main=Regression tree for Boston housing prices)
text(rt, all=TRUE, use.n = TRUE, cex=.8 )
Okay, still hard to read , but hopefully you get the idea by looking at this plot and the node splits printed out by R.
## node), split, n, deviance, yval
## * denotes terminal node
##1) root 405 32961.0200 22.31309
##2) lstat>=9.725 2415911.4170 17.37095
##4) lstat>=19.23 671280.6070 12.44776
##8) nox>=0.603 49 508.2253 10.72245 *
##9) nox< 0.603 18 229.4644 17.14444 *##5) lstat< 19.23 1742381.5670 19.26667## 10) lstat>=15 68 739.9024 17.02353 *
## 11) lstat< 15 1061080.0170 20.70566 *##3) lstat< 9.725 164 12513.2000 29.57561##6) rm< 6.941 1143578.7230 25.53596## 12) age< 90.25 1071642.4970 24.74019## 24) rm< 6.531 67 619.3346 22.89104 *## 25) rm>=6.531 40 410.3338 27.83750 *
## 13) age>=90.25 7 832.7200 37.70000 *
##7) rm>=6.941 502832.5800 38.78600
## 14) rm< 7.437 28 622.8011 34.33214 *## 15) rm>=7.437 22 947.4345 44.45455 *
Did we overfit?
Okay, we fit a tree to the training data. Lets see how we did. First of all, lets check our MSE on the training data. Well compute our models predicted output on each of the training instances. The predict function works with the rpart objects just like with other models youve seen this semester.
yhat_train <-predict( rt, Boston.train);train_resids <- yhat_train – Boston.train$medv; # compute the residuals# Now square them and take the mean to get MSE.# Note that if we just computed the RSS (i.e., no mean, just a sum), it would be# harder to directly compare to the test set, which is of a different size.mean(train_resids^2)## [1] 14.7907Now do the test set.Modify the code above to compute the MSE of our regression tree on the test set.#TODO: code goes here.Unless something really weird happened, you should see that the RSS is a lot worse. Thats a lot worse. Looks like we might have overfitPruning the treeLucky for us, regression trees have a natural regularization method we can make the tree less bushy by pruning it. In essence, this serves to smooth out the piecewise constant function that our tree encodes. We do that with the prune function. Note that there are other functions called prune (youll see this if you type ?prune), so youre best off specifying that you mean rpart::prune.To prune a tree, we just pass our tree into this function, and specify the complexity parameter cp. Smaller values of this number correspond to more splits, i.e., more complicated models.So lets first make sure we understand that parameter. Every rpart object has an attribute called cptable, which contains information that shows how the number of splits (again, thats the number of nodes) varies with this cp parameter. Each cp value corresponds to the largest cp value that corresponds to this complexity any larger and you cant afford enough splits with your complexity budget. Note that the table also includes information about each models error, as well as information for doing CV (see Section 4 of the rpart long-form documentation; we return to this point briefly below ).rt$cptable## CP nsplit rel errorxerror xstd## 1 0.441017940 1.0000000 1.0081323 0.09436808## 2 0.185124721 0.5589821 0.6586532 0.06538595## 3 0.068239482 0.3738573 0.4914515 0.05572899## 4 0.038298103 0.3056179 0.4310189 0.05381044## 5 0.033479104 0.2673198 0.4135346 0.05543222## 6 0.018592535 0.2338407 0.3944587 0.05543053## 7 0.017039756 0.2152481 0.3751225 0.05330331## 8 0.016471507 0.1982084 0.3727647 0.05331989## 9 0.010000008 0.1817369 0.3461705 0.05294787As cp gets smaller, the number of splits (nsplit) decreases. If we ask for a tree with a particular cp value, rpart will go to our trees cptable and pick out a tree with the largest number of splits that is no more complicated that our cp score (i.e., no smaller). So, for example, if we ask for cp=0.4, we just get back the tree with a single split (i.e., a root node with two children):plot( rpart::prune(rt, cp=0.4) )If we set cp=0.03, we get back 4 splits:plot( rpart::prune(rt, cp=0.03) )Looking at that tree, you can see that there are indeed four splits (i.e., four nodes the root, the one on the right, and the two to the left of the root).Now, we can use this to do CV just like weve seen previously (if we had (K) folds instead of just a train-test split), or we could use the 1-SD rule, about which see the end of section 6.1 in your textbook or refer to Section 4 of the rpart documentation.Just for the sake of illustration, though, lets just try something reasonable and use the 4-split tree.Modify the code above to measure the MSE of this pruned tree on the training and test data. Assess the presence or absence of over-fitting. Is the overfitting at least less bad than the tree we started with?# TODO: code goes here.Do the same with the 5-split tree and compare to four splits.#TODO: code goes here. CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.