IEOR 142: Introduction to Machine Learning and Data Analytics, Spring 2021
Name:
Instructions:
Practice Midterm Exam 1
March 2021
1. Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, continue on the back of the page.
2. You are allowed one (double sided) 8.5 x 11 inch note sheet and a simple pocket calculator. The use of any other note sheets, textbook, computer, cell phone, other electronic device besides a simple pocket calculator, or other study aid is not permitted.
3. You will have until 5:00PM to turn in the exam.
4. Whenever a question asks for a numerical answer (such as 2.7), you may write your answer as an
expression involving simple arithmetic operations (such as 2(1) + 1(0.7)).
5. Good luck!
1
IEOR 142 Practice Midterm Exam, Page 2 of 16 March 2021
1 True/False and Multiple Choice Questions 48 Points
Instructions: Please circle exactly one response for each of the following 12 questions. Each question is worth 4 points. There will be no partial credit for these questions.
1. Suppose that we train a classification model that has accuracy equal to 1 (i.e., perfect 100% accuracy) on the test set, and that the test set contains at least one positive observation and at least one negative observation. Then the TPR (true positive rate) of that model on the test set is also equal to 1.
A. True
B. False
2. Suppose that we train a classification model that has accuracy equal to 0.99 on the test set, and that the test set contains at least one positive observation and at least one negative observation. Then, without any other information, the most definitive statement we can make about the TPR (true positive rate) of that model on the test set is:
A. The TPR is equal to 0.99 B. The TPR is equal to 1 C. The TPR is at least 0.90
D. The TPR is between 0 and 1
3. Consider two linear regression models trained on the same training set. Model A uses 15 independent variables and has a training set R2 value of 0.79. Model B uses 10 independent variables and has a training set R2 value of 0.68. Then, when comparing the two models on the same test set, Model A must have a higher value of OSR2 than Model B.
A. True
B. False
4. The main purpose of bagging (bootstrap aggregating) is to estimate the out-of-sample error. A. True
B. False
5. Boosting is inherently sequential since each new decision tree is trained in a way that uses information from the previously trained decision trees, whereas Random Forests is inherently parallelizable since each individual decision tree is trained independently of all the others.
A. True
B. False
6. In multiple linear regression (p > 1), it is possible for a subset of the independent variables to all have large VIF values and at the same time have somewhat small pairwise correlation values with each other.
A. True
B. False
7. Suppose that LF N = 2 and LF P = 1. Let p denote the probability that a given observation is a positive. Then, in order to minimize expected cost, an optimal policy is to assign an observation as a positive if and only if p is greater than 1/3.
A. True
B. False
IEOR 142 Practice Midterm Exam, Page 3 of 16 March 2021
8. The Random Forests method tends to produce many uncorrelated trees (which are then averaged to- gether) since:
A. Each individual tree is trained on a fresh bootstrap sample of the training set
B. When training each individual tree, only a randomly selected subset of the features are con- sidered at each split
C. Both (a) and (b) are true
D. Both (a) and (b) are false
9. Suppose that we have a dataset consisting of n = 2,342 observation vectors xi. We are interested in constructing between five to ten different clusters to assign each observation to. If we use the K-means algorithm for this task, then to select the final number of clusters K:
A. We must run the K-means algorithm twice, with K = 5 and then with K = 10 B. We must run the K-means algorithm only once with K = 10
C. We must run the K-means algorithm six times with K = 5, 6, 7, 8, 9, 10
D. The K-means algorithm will automatically choose the number of clusters K for us
10. Consider the following ROC curve based on a logistic regression model for predicting lung cancer here having lung cancer is a positive outcome. The baseline is also drawn for comparison. Suppose that a doctor would like to minimize the number of times that she tells a patient that they do not have lung cancer when they actually do. At the same time, the doctor is only willing to incorrectly tell a patient that they have lung cancer when they actually do not at most 50% of the time. Then, which point on the ROC curve should the doctor use to determine the correct threshold value?
A. A
B. B
C. C D. D
IEOR 142 Practice Midterm Exam, Page 4 of 16 March 2021
11. Suppose that, conditioned on Y = 1, X is normally distributed with mean 4 and variance 1. Similarly, conditioned on Y = 2, X is normally distributed with mean 5.5 and variance 1. Now, given a new observation X = x, we are interested in predicting whether Y = 1 or Y = 2. A threshold value of 4.35 is chosen, so that we predict Y = 2 if x 4.3 and Y = 1 if x < 4.3. This is represented pictorially in Figure 1.Figure 1Figure 2 defines five different shaded regions within Figure 1, and the letters refer also to their respective areas.Figure 2Suppose that Y = 1 corresponds to a positive outcome. Then the FPR (false positive rate) is equal to: A. (C+D)/(A+B+C+D)B. (C +D)/(A+B)C. B/(D+E)D. B/(B+D+E)IEOR 142 Practice Midterm Exam, Page 5 of 16 March 2021 12. Figure 3 shows a time series plot for daily bike rentals in Washington DCs Capital Bikeshare system. Based on this plot, which of the following time series modeling methodologies are most appropriate:A. A model with seasonality variables B. An autoregressive modelC. A linear trend modelD. A model that incorporates all of the aboveFigure 3 8000600040002000201101 201107 201201Date201207 total_rentalsIEOR 142 Practice Midterm Exam, Page 6 of 16 March 2021 2 Short Answer Questions 52 PointsInstructions: Please provide justification and/or show your work for all questions, but please try to keep your responses brief. Your grade will depend on the clarity of your answers, the reasoning you have used, as well as the correctness of your answers.The following questions are based on data from YourCabs.com, an online platform for matching the supply and demand for taxi cabs in Bangalore, India. Riders make booking requests on the YourCabs platform, and cab drivers are independent contractors who are linked to the riders via the YourCabs platform. Occasionally a matched driver may cancel a booked trip before the scheduled pick-up time. Often, these cancellations occur at the last minute before the scheduled pick-up time, or in fact the cancellation is more aptly a no- show on the part of the driver. YourCabs would like to examine the use of machine learning models for predicting whether or not booking requests will ultimately result in a cancellation by the driver. YourCabs has collected data concerning 3,375 booking requests that occurred in a particular area in Bangalore during 2013, and this data is summarized in Table 1.VariableVehicleModelIdOnlineBookingMobileSiteBookingBookingDateTimeTripDateTimeCancellationPlease answer the following questions.Table 1: Description of the dataset.DescriptionEncodes the type of the drivers vehicle (one of 14 possible values)1 if the booking was made on the regular website, 0 if not1 if the booking was made on the mobile version of the website, 0 if notDate and time that the booking was made (stored as a timestamp string such as 1/3/2013 19:13)Scheduled date and time of the start time of the trip (stored as a timestamp string such as 1/3/2013 19:13)1 if this booking request resulted in a cancellation by the driver, and 0 if not IEOR 142 Practice Midterm Exam, Page 7 of 16 March 2021 1. (6 points) The dataset was randomly split into a training set and a test set, with 2,362 (about 70%) of the observations placed in the training set and 1,013 (about 30%) of the observations placed in the test set. Of the 2,362 total observations in the training set, only 80 observations correspond to cancellations while the remaining 2,282 observations were not cancellations. Of the 1,013 total observations in the test set, only 35 observations correspond to cancellations while the remaining 978 observations were not cancellations.(a) (3 points) Consider a baseline model that does not use any features at all. What is the appropriate baseline model for this dataset? Solution: Since the training set contains more observations that were not cancellations than those that were cancellations, the baseline model is to always predict that there will be no cancellation. (b) (3 points) What is the accuracy of the baseline model you selected in part (a) on the test set? What is its TPR (true positive rate) on the test set? What is its FPR (false positive rate) on the test set? Solution: The accuracy is 978 , TPR is 0, and FPR is 0. 1,013IEOR 142 Practice Midterm Exam, Page 8 of 16 March 2021 2. (8 points) YourCabs has recently begun exploring the idea of reassigning booking requests that are likely to result in cancellations to more reliable drivers. Towards this end, YourCabs has compiled a curated list of such reliable drivers, and, on average, drivers on this list cancel bookings only 0.1% of the time. However, there is a cost to reassigning booking requests. Namely, the original driver may be upset that the request was lost and may end up leaving the platform forever with a certain probability. YourCabs estimates that the average cost per reassignment is $10 (USD for simplicity). Naturally there is also a cost associated with each cancellation, due to the lost revenue that is incurred if the rider stops using the platform. YourCabs estimates that this average cost per cancellation is $100. Finally, YourCabs estimates that the average profit per successful ride is $25.A decision tree capturing this analysis is shown in Figure 4. Determine a threshold value, pthresh, such that it is optimal (with regard to expected profit) to reassign a booking to a more reliable driver if and only if the probability of cancellation p exceeds pthresh.Figure 4: Decision tree for possibly reassigning a booking to a more reliable driver. The leaf nodes represent profit values and p represents the probability that the booking request would result in a cancellation by the original driver.Solution: The threshold value, pthresh, is the value of p such that the two choices lead to equal profit. Thus we have:0.001(110)+0.999(15)=p(100)+(1p)(25) 14.875=25125p pthresh = 25 14.875 = 0.081 125IEOR 142 Practice Midterm Exam, Page 9 of 16 March 2021 3. (20 points) As a first pass, a logistic regression model was built (using the training data) to predict Cancellation based on the features VehicleModelId, OnlineBooking, and MobileSiteBooking. The corresponding R code and its output are shown below.> log.mod1 <- glm(Cancellation ~ VehicleModelId + OnlineBooking + MobileSiteBooking,+ data = bookings.train,+ family = “binomial”)> summary(log.mod1)
Call:
glm(formula = Cancellation ~ VehicleModelId + OnlineBooking +
MobileSiteBooking, family = binomial, data = bookings.train)
Deviance Residuals:
Min1QMedian3Q Max
-0.47911-0.36831-0.16594-0.00007 3.00057
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.98307811.7349
-0.0030.998
0.0020.998
0.0001.000
0.0001.000
0.0001.000
0.0020.998
0.0001.000
0.0001.000
0.0001.000
0.0001.000
0.0001.000
0.0020.998
0.0001.000
0.0001.000
4.657 3.20e-06 ***
5.246 1.55e-07 ***
VehicleModelId12
VehicleModelId17
VehicleModelId23
VehicleModelId24
VehicleModelId28
VehicleModelId64
VehicleModelId65
VehicleModelId85
VehicleModelId86
VehicleModelId87
VehicleModelId89
VehicleModelId90
VehicleModelId91
OnlineBooking 1.6218 0.3482
MobileSiteBooking 2.1716 0.4139
17.70467811.7349
0.3730 14433.6515
0.30398411.4265
0.26567939.7061
16.69127811.7349
0.6338 12587.2832
0.47567995.5594
0.37987904.2323
1.12229541.5994
0.18109248.4331
17.49257811.7349
0.19358331.4000
0.6338 12587.2832
Signif. codes:0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 698.90on 2361degrees of freedom
Residual deviance: 612.88on 2346degrees of freedom
AIC: 644.88
Number of Fisher Scoring iterations: 19
Please answer the following questions concerning this logistic regression model.
IEOR 142 Practice Midterm Exam, Page 10 of 16 March 2021
(a) (4 points) Recall that there are 14 possible values for VehicleModelId in this dataset, and note that there are only 13 coefficients associated with VehicleModelId above. Briefly explain why there is no coefficient associated with VehicleModelId10 above.
Solution: VehicleModelId is a categorical variable, and by default R uses a dummy coding where one of the possible values for VehicleModelId (in this case the ID is 10) is incorporated into the intercept term. This makes sense because, given that the VehicleModelId is not one of the other 13 values, then it must be equal to 10 and so an additional feature associated with VehicleModelId10 would not provide any additional information.
IEOR 142 Practice Midterm Exam, Page 11 of 16 March 2021
(b) (4 points) A second logistic regression model (with VehicleModelId removed) was built (using the training data) to predict Cancellation based on the features OnlineBooking and MobileSite- Booking. The corresponding R code and its output are shown below.
> log.mod2 <- glm(Cancellation ~ OnlineBooking + MobileSiteBooking, + data = bookings.train, + family = “binomial”) > summary(log.mod2)
Call:
glm(formula = Cancellation ~ OnlineBooking + MobileSiteBooking,
family = binomial, data = bookings.train)
Deviance Residuals:
Min 1Q Median 3QMax
-0.4290-0.3157-0.3157-0.1371 3.0568
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)-4.6625
OnlineBooking 1.6883
MobileSiteBooking 2.3231
0.3177 -14.675< 2e-16 ***0.3470 4.865 1.14e-06 ***0.4117 5.643 1.67e-08 ***Signif. codes:0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 698.90on 2361degrees of freedomResidual deviance: 653.63on 2359degrees of freedomAIC: 659.63Number of Fisher Scoring iterations: 7Briefly explain the reasoning for removing VehicleModelId. Solution: All of the coefficients corresponding to this factor variable are extremely not signif- icant (p-values very close to 1). IEOR 142 Practice Midterm Exam, Page 12 of 16 March 2021 (c) (6 points) Suppose that a new rider makes a booking on the mobile site and that the assigned driver has a vehicle model with ID 65. Using the second logistic regression model from part (b), make a prediction for the probability that this driver will cancel the booking. Solution: In this case, we have that there are two non-zero independent variables: VehicleMod- elId65 = 1 and MobileSiteBooking = 1. Therefore the predicted probability is given by the equation:P(Y =1|X=x)= 1 =0.0879 1 + e(4.6625+2.3231(1))(d) (3 points) Figure 5 displays a confusion matrix, computed on the test set, for the logistic regression model. Here, the optimal threshold value of pthresh derived in question (2) was used. Note that the columns correspond to whether or not the booking would have been reassigned according to the policy described in part (2). (As a reminder, the observations all occurred before YourCabs started considering the possibility of reassigning booking requests.)Figure 5: Confusion Matrix for Logistic Regression Model With Threshold pthreshWhat is the accuracy, TPR, and FPR of the logistic regression model with threshold pthresh?(e) (3 points) Using the information given in question (2) and part (d) above, what is the total test set profit of the policy implied by the logistic regression model with threshold equal to pthresh? For this problem, you may assume that whenever a booking request is reassigned YourCabs deterministically makes a profit equal to the expected profit gained per reassignment. Do Not Reassign BookingReassign BookingNo Cancellation90969Cancellation269Solution: Accuracyis 909+9 = 918 ,TPRis 9 = 9 ,FPRis 69 = 69 . 909+9+26+69 1013 26+9 35 909+69 978Solution: The expected profit gained per reassignment is 0.001($110)+0.999($15) = $14.875. Given the assumption, the total test set profit is:$14.875(69 + 9) + $25(909) + $100(26) = $21285.25 IEOR 142 Practice Midterm Exam, Page 13 of 16 March 2021 4. (10 points) Next, a CART model was built (using the training data) to predict Cancellation based on the features VehicleModelId, OnlineBooking, and MobileSiteBooking. The values of LF N and LF P were set to LFN = 11.34568 and LFP = 1, which correspond to values such that minimizing loss is equivalent to maximizing profit according to the decision tree in question (2). 10 fold cross-validation was used to select the cp parameter, and the output of the corresponding R code is shown below.CART2362 samples 3 predictor 2 classes: 0, 1No pre-processingResampling: Cross-Validated (10 fold)Summary of sample sizes: 2126, 2126, 2126, 2125, 2126, 2125, …Resampling results across tuning parameters:cp AvgLossAccuracy0.0000.36339670.92127050.0010.36170890.92295820.0020.36170890.92295820.0030.36170890.92295820.0040.36170890.92295820.0050.36170890.92295820.0060.36170890.92295820.0070.36170890.92295820.0080.36170890.92295820.0090.36170890.92295820.0100.36170890.92295820.0110.36170890.92295820.0120.37231780.92550060.0130.37231780.92550060.0140.37231780.92550060.0150.37231780.92550060.0160.37854300.92804300.0170.38929830.93482260.0180.39382860.93905990.0190.39382860.93905990.0200.39624020.94541590.0210.39624020.94541590.0220.39596290.95007690.0230.39596290.95007690.0240.39442100.95598410.0250.39442100.95598410.0260.39442100.95598410.0270.39442100.95598410.0280.39442100.95598410.0290.39442100.95598410.0300.39442100.95598410.0310.39442100.9559841IEOR 142 Practice Midterm Exam, Page 14 of 16 March 2021 0.0320.39442100.9559841 0.0330.39442100.9559841 0.0340.39442100.9559841 0.0350.39442100.9559841 0.0360.39442100.9559841 0.0370.39442100.9559841 0.0380.39442100.9559841 0.0390.39442100.9559841 0.0400.38976000.9606451Please answer the following questions concerning the CART model.(a) (3 points) Using the output above, select a value of cp based on the average loss criterion.Solution: cp = 0.011 corresponds to the smallest average loss value, 0.3617089. (b) (3 points) Using the output above, select a value of cp based on the accuracy criterion. Solution: cp = 0.040 corresponds to the largest accuracy value, 0.9606451. IEOR 142 Practice Midterm Exam, Page 15 of 16 March 2021 (c) (4 points) The tree corresponding to one of the previously selected models is displayed in Figure 6. Describe in words the type of bookings that the CART tree predicts to be cancellations.Figure 6: CART Model.VehicleModelId12 < 0.5yesno 0 OnlineBooking < 0.5MobileSiteBooking < 0.5 001Solution: The only way to predict a cancellation is if VehicleModelId12 is NOT less than 0.5, OnlineBooking is less than 0.5, and MobileSiteBooking is NOT less than 0.5. Since these are binary variables, this corresponds to VehicleModelId12 = 1, OnlineBooking = 0, and Mo- bileSiteBooking = 1. Therefore, the type of bookings that the CART tree predicts to be cancellations are those that are booked on the mobile site and in addition the driver has a vehicle of model 12. IEOR 142 Practice Midterm Exam, Page 16 of 16 March 2021 5. (8 points) So far, we have not used the columns BookingDateTime or TripDateTime in any of the models. Describe two distinct ways to construct numerical feature(s) based on BookingDateTime and/or TripDateTime that may be incorporated into any of the previous models. Note that for full credit at least one of your feature generation methods should incorporate both of BookingDateTime and TripDateTime. Solution: One possibility is to create a factor variable based on the day of the week that the trip is scheduled for, i.e., the day of week derived from TripDateTime. This factor variable can be made into a set of numerical features in the usual way with a dummy coding scheme.Another possibility is to model the lag between the trip time and the booking time by creating a numerical variable equal to the difference (say, in minutes) between TripDateTime and Booking- DateTime. This one incorporates both BookingDateTime and TripDateTime.Both of these derived features may be incorporated into any of the predictive models discussed in the exam and also more broadly in class.
Reviews
There are no reviews yet.