Make sure that you upload a knitted pdf or html file to the canvas page (this should have a .pdf or .html extension). Also upload the .Rmd file. Include output for each question in its own individual code chunk and dont print out any vector that has more than 20 elements.
Objectives: KNN Classification and Cross-Validation
Background
Today well be using the Weekly dataset from the ISLR package. This data is similar to the Smarket data from class. The dataset contains 1089 weekly returns from the beginning of 1990 to the end of 2010. Make sure that you have the ISLR package installed and loaded by running (without the code commented out) the following:
# install.packages(ISLR) library(ISLR)
## Warning: package ISLR was built under R version 4.0.2
Wed like to see if we can accurately predict the direction of a weeks return based on the returns over the last five weeks. Today gives the percentage return for the week considered and Year provides the year that the observation was recorded. Lag1 Lag5 give the percentage return for 1 5 weeks previous and Direction is a factor variable indicating the direction (UP or DOWN) of the return for the week considered.
Part 1: Visualizing the relationship between this weeks returns and the previous weeks returns.
- Explore the relationship between a weeks return and the previous weeks return. You should plot more graphs for yourself, but include in the lab write-up a scatterplot of the returns for the weeks considered (Today) vs the return from two weeks previous (Lag2), and side-by-side boxplots for the lag one week previous (Lag1) divided by the direction of this weeks Reuther (Direction).
Returns
Two Weeks Ago
Returns
Direction
Part 2: Building a classifier
Recall the KNN procedure. We classify a new point with the following steps: Calculate the Euclidean distance between the new point and all other points.
- Create the set Nnew containing the K closest points (or, nearest neighbors) to the new point.
- Determine the number of UPs and DOWNs in Nnew and classify the new point according to the most frequent.
- Wed like to perform KNN on the Weekly data, as we did with the Smarket data in class. In class we wrote the following function which takes as input a new point (Lag1new,Lag2new) and provides the KNN decision using as defaults K = 5, Lag1 data given in Smarket$Lag1, and Lag2 data given in Smarket$Lag2. Update the function to calculate the KNN decision for weekly market direction using the Weekly dataset with Lag1 Lag5 as predictors. Your function should have only three input values: (1) a new point which should be a vector of length 5, (2) a value for K, and (3) the Lag data which should be a data frame with five columns (and n rows).
KNN.decision <- function(Lag1.new, Lag2.new, K = 5, Lag1 = Smarket$Lag1, Lag2 = Smarket$ n <- length(Lag1)stopifnot(length(Lag2) == n, length(Lag1.new) == 1, length(Lag2.new) == 1, K <= n)dists <- sqrt((Lag1Lag1.new)^2 + (Lag2Lag2.new)^2) neighbors <- order(dists)[1:K] neighb.dir <- Smarket$Direction[neighbors] choice <- names(which.max(table(neighb.dir))) return(choice)} |
Lag2) {
- Now train your model using data from 1990 2008 and use the data from 2009-2010 as test data. To do this, divide the data into two data frames, test and train. Then write a loop that iterates over the test points in the test dataset calculating a prediction for each based on the training data with K = 5. Save these predictions in a vector. Finally, calculate your test error, which you should store as a variable named error. The test error calculates the proportion of your predictions which are incorrect (dont match the actual directions).
- Do the same thing as in question 3, but instead use K = 3. Which has a lower test error?
Part 3: Cross-validation
Ideally wed like to use our model to predict future returns, but how do we know which value of K to choose? We could choose the best value of K by training with data from 1990 2008, testing with the 2009 2010 data, and selecting the model with the lowest test error as in the previous section. However, in order to build the best model, wed like to use ALL the data we have to train the model. In this case, we could use all of the Weekly data and choose the best model by comparing the training error, but unfortunately this isnt usually a good predictor of the test error.
In this section, we instead consider a class of methods that estimate the test error rate by holding out a
(random) subset of the data to use as a test set, which is called k-fold cross validation. (Note this lower case k is different than the upper case K in KNN. They have nothing to do with each other, it just happens that the standard is to use the same letter in both.) This approach involves randomly dividing the set of observations into k groups, or folds, of equal size. The first fold is treated as a test set, and the model is fit on the remaining k 1 folds. The error rate, ERR1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a test set. This process results in k estimates of the test error: ERR1, ERR2, , ERRk. The k-fold CV estimate of the test error is computed by averaging these values,
k
1 X
CV(k) = ERRk.
k
i=1
Well run a 9-fold cross-validation in the following. Note that we have 1089 rows in the dataset, so each fold will have exactly 121 members.
- Create a vector fold which has n elements, where n is the number of rows in Weekly. Wed like for the fold vector to take values in 1-9 which assign each corresponding row of the Weekly dataset to a fold. Do this in two steps: (1) create a vector using rep() with the values 1-9 each repeated 121 times (note 1089 = 121 9), and (2) use sample() to randomly reorder the vector you created in (1).
- Iterate over the 9 folds, treating a different fold as the test set and all others the training set in each iteration. Using a KNN classifier with K = 5 calculate the test error for each fold. Then calculate the cross-validation approximation to the test error which is the average of ERR1, ERR2, , ERR9.
- Repeat step (6) for K = 1, K = 3, and K = 7. For which set is the cross-validation approximation to the test error the lowest?
Reviews
There are no reviews yet.