[SOLVED] 留学生考试辅导 STAT340 Discussion 13: Tree-based Methods

30 $

File Name: 留学生考试辅导_STAT340_Discussion_13:_Tree-based_Methods.zip
File Size: 593.46 KB

SKU: 7647625192 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


STAT340 Discussion 13: Tree-based Methods

STAT340 Discussion 13: Tree-based Methods

Copyright By PowCoder代写加微信 assignmentchef

Keith Levin and Wu

Link to source file

In lecture, we saw our first example of tree-based methods: regression trees. This discussion section will give you practice building regression trees using the built-in tools in R. The notes largely follow the lab in Section 8.3.2 in ISLR.

The two most commonly used packages in R for building regression trees are tree and rpart (short for “recursive partitioning”– remember that regression trees really just partition the data space). rpart seems to be eclipsing tree in terms of popularity, so we’ll use that. See here to read about the tree package if you’re interested. The lab in ISLR uses the tree package, if you want a nice example of how to use it.

The documentation for rpart is here. It should be available in your R installation, but if not, you can install it in the usual way.

library(rpart)
# See documentation with ?rpart.

The Boston Housing data set is a famous data set that contains median housing prices for approximately 500 towns in the Boston area (collected in the 70s– these houses are a lot more expensive, now). It is included in the MASS library. Let’s load it, and take a few minutes to read the documentation to make sure we understand what all the variables mean.

data(Boston)

We’re going totry and predict the price, which is stored in Boston$medv, short for median value, the median price of all homes in the town, in thousands of dollars.

hist( Boston$medv)

Question: looking at this histogram, should we consider a variable transformation? Why or why not? Discuss.

You’ll see some analyses of this data set out there that take logarithms of these prices. We’ll leave them as is, if for no other reason than for keeping our analysis roughly similar to the worked example in your textbook, but this kind of thing is always worth considering when you have price or salary data.

Setting up the data

We’ll split the data into a train and test set, to demonstrate effects of overfitting. We’ll set aside one fifth of the data as test data, and the rest will be train.

test_inds <- sample(1:nrow(Boston), nrow(Boston)/5);Boston.test <- Boston[test_inds,];Boston.train <- Boston[-test_inds,];head(Boston.test)##crim zn indus chas noxrmagedis rad tax ptratioblack lstat## 262 0.53412 203.970 0.647 7.520 89.4 2.1398 5 26413.0 388.377.26## 335 0.0373805.190 0.515 6.310 38.5 6.4584 5 22420.2 389.406.75## 249 0.16439 225.860 0.431 6.433 49.1 7.8265 7 33019.1 374.719.52## 287 0.01965 801.760 0.385 6.230 31.5 9.0892 1 24118.2 341.60 12.93## 420.1274406.910 0.448 6.7702.9 5.7209 3 23317.9 385.414.84## 452 5.441140 18.100 0.713 6.655 98.2 2.355224 66620.2 355.29 17.73## medv## 262 43.1## 335 20.7## 249 24.5## 287 20.1## 4226.6## 452 15.2head(Boston.train)##crim zn indus chas noxrmagedis rad tax ptratioblack lstat## 1 0.00632 18.02.310 0.538 6.575 65.2 4.0900 1 29615.3 396.904.98## 2 0.027310.07.070 0.469 6.421 78.9 4.9671 2 24217.8 396.909.14## 4 0.032370.02.180 0.458 6.998 45.8 6.0622 3 22218.7 394.632.94## 5 0.069050.02.180 0.458 7.147 54.2 6.0622 3 22218.7 396.905.33## 6 0.029850.02.180 0.458 6.430 58.7 6.0622 3 22218.7 394.125.21## 7 0.08829 12.57.870 0.524 6.012 66.6 5.5605 5 31115.2 395.60 12.43Fitting the treeFitting a regression tree with rpart is largely analogous to fitting logistic regression with glm. We specify a formula, a data set, and a method (really a loss function).rt <- rpart( medv ~ ., data=Boston.train, method=’anova’)# method=’anova’ tells rpart to use the squared errors loss we saw in lecture.# If we just naively print out the tree, we see a text representation of its node splits.## node), split, n, deviance, yval## * denotes terminal node##1) root 405 32961.0200 22.31309##2) lstat>=9.725 2415911.4170 17.37095
##4) lstat>=19.23 671280.6070 12.44776
##8) nox>=0.603 49 508.2253 10.72245 *
##9) nox< 0.603 18 229.4644 17.14444 *##5) lstat< 19.23 1742381.5670 19.26667## 10) lstat>=15 68 739.9024 17.02353 *
## 11) lstat< 15 1061080.0170 20.70566 *##3) lstat< 9.725 164 12513.2000 29.57561##6) rm< 6.941 1143578.7230 25.53596## 12) age< 90.25 1071642.4970 24.74019## 24) rm< 6.531 67 619.3346 22.89104 *## 25) rm>=6.531 40 410.3338 27.83750 *
## 13) age>=90.25 7 832.7200 37.70000 *
##7) rm>=6.941 502832.5800 38.78600
## 14) rm< 7.437 28 622.8011 34.33214 *## 15) rm>=7.437 22 947.4345 44.45455 *

Okay, you can probably figure out what’s going on there if you stare at it enough, but I would much rather just look at the tree itself…

Okay, but where are the labels? If you think back to our discussion of dendrograms and hierarchical clustering, you’ll recall that R is generally bad at handling trees. You need to tell R’s plotting functions about the text labels separately.

# uniform-=TRUE helps to prevent the node labels from bumping up against the
# each other.
plot(rt, uniform=TRUE, main=’Regression tree for Boston housing prices’)
text(rt, all=TRUE, use.n = TRUE, cex=.8 )

Okay, still hard to read , but hopefully you get the idea by looking at this plot and the node splits printed out by R.

## node), split, n, deviance, yval
## * denotes terminal node
##1) root 405 32961.0200 22.31309
##2) lstat>=9.725 2415911.4170 17.37095
##4) lstat>=19.23 671280.6070 12.44776
##8) nox>=0.603 49 508.2253 10.72245 *
##9) nox< 0.603 18 229.4644 17.14444 *##5) lstat< 19.23 1742381.5670 19.26667## 10) lstat>=15 68 739.9024 17.02353 *
## 11) lstat< 15 1061080.0170 20.70566 *##3) lstat< 9.725 164 12513.2000 29.57561##6) rm< 6.941 1143578.7230 25.53596## 12) age< 90.25 1071642.4970 24.74019## 24) rm< 6.531 67 619.3346 22.89104 *## 25) rm>=6.531 40 410.3338 27.83750 *
## 13) age>=90.25 7 832.7200 37.70000 *
##7) rm>=6.941 502832.5800 38.78600
## 14) rm< 7.437 28 622.8011 34.33214 *## 15) rm>=7.437 22 947.4345 44.45455 *

Did we overfit?

Okay, we fit a tree to the training data. Let’s see how we did. First of all, let’s check our MSE on the training data. We’ll compute our model’s predicted output on each of the training instances. The predict function works with the rpart objects just like with other models you’ve seen this semester.

yhat_train <-predict( rt, Boston.train);train_resids <- yhat_train – Boston.train$medv; # compute the residuals# Now square them and take the mean to get MSE.# Note that if we just computed the RSS (i.e., no mean, just a sum), it would be# harder to directly compare to the test set, which is of a different size.mean(train_resids^2)## [1] 14.7907Now do the test set.Modify the code above to compute the MSE of our regression tree on the test set.#TODO: code goes here.Unless something really weird happened, you should see that the RSS is a lot worse. That’s a lot worse. Looks like we might have overfit…Pruning the treeLucky for us, regression trees have a natural regularization method– we can make the tree less bushy by pruning it. In essence, this serves to “smooth out” the piecewise constant function that our tree encodes. We do that with the prune function. Note that there are other functions called prune (you’ll see this if you type ?prune), so you’re best off specifying that you mean rpart::prune.To prune a tree, we just pass our tree into this function, and specify the complexity parameter cp. Smaller values of this number correspond to more splits, i.e., more complicated models.So let’s first make sure we understand that parameter. Every rpart object has an attribute called cptable, which contains information that shows how the number of splits (again, that’s the number of nodes) varies with this cp parameter. Each cp value corresponds to the largest cp value that corresponds to this complexity– any larger and you can’t afford enough splits with your complexity “budget”. Note that the table also includes information about each model’s error, as well as information for doing CV (see Section 4 of the rpart long-form documentation; we return to this point briefly below ).rt$cptable## CP nsplit rel errorxerror xstd## 1 0.441017940 1.0000000 1.0081323 0.09436808## 2 0.185124721 0.5589821 0.6586532 0.06538595## 3 0.068239482 0.3738573 0.4914515 0.05572899## 4 0.038298103 0.3056179 0.4310189 0.05381044## 5 0.033479104 0.2673198 0.4135346 0.05543222## 6 0.018592535 0.2338407 0.3944587 0.05543053## 7 0.017039756 0.2152481 0.3751225 0.05330331## 8 0.016471507 0.1982084 0.3727647 0.05331989## 9 0.010000008 0.1817369 0.3461705 0.05294787As cp gets smaller, the number of splits (nsplit) decreases. If we ask for a tree with a particular cp value, rpart will go to our tree’s cptable and pick out a tree with the largest number of splits that is no more complicated that our cp score (i.e., no smaller). So, for example, if we ask for cp=0.4, we just get back the tree with a single split (i.e., a root node with two children):plot( rpart::prune(rt, cp=0.4) )If we set cp=0.03, we get back 4 splits:plot( rpart::prune(rt, cp=0.03) )Looking at that tree, you can see that there are indeed four splits (i.e., four nodes– the root, the one on the right, and the two to the left of the root).Now, we can use this to do CV just like we’ve seen previously (if we had (K) folds instead of just a train-test split), or we could use the “1-SD” rule, about which see the end of section 6.1 in your textbook or refer to Section 4 of the rpart documentation.Just for the sake of illustration, though, let’s just try something reasonable and use the 4-split tree.Modify the code above to measure the MSE of this pruned tree on the training and test data. Assess the presence or absence of over-fitting. Is the overfitting at least less bad than the tree we started with?# TODO: code goes here.Do the same with the 5-split tree and compare to four splits.#TODO: code goes here.程序代写 CS代考加微信: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 留学生考试辅导 STAT340 Discussion 13: Tree-based Methods
30 $