title: BAS 474 S19 Computational Takehome Exam
author: Guodong Ma
date: due 4/30/2019 11am
`{r setup, include=FALSE}
#DO NOT CHANGE ANYTHING HERE
knitr::opts_chunk$set(echo = TRUE)
library(regclass)
library(caret)
library(arules)
library(pROC)
options(digits=7)
#If you need to re-generate your exam or data file, highlight the bit after the hashtag below and run that
source(S19takehome.R);make_my_examS19( 395989 )
`
Carefully follow the instructions on each question.You will be left-arrowing a total of 14 objects (Q1, Q2, Q3, Q4,
Q5, Q6, etc).Although it is recommended to left-arrow into these objects the result of running a command, e.g. Q1 <- mean(x), it is ok to hard-code the answers as long as all code working up to the answer is shown.If you hard-code, do not round. Each question is worth 7 points (you get 2 points for turning in the exam), but there is no partial credit. Make sure to save this .Rmd file and upload it to Canvas by the due date.It does not need to knit.However, do one last check of your code by clicking Run, Run All and making sure all parentheses etc. are matched up.If they aren’t, the prompt on the command line looks like a + rather than a >.
This exam is strictly NO COLLABORATION.Do not communicate or discuss verbally, electronically, in writing, etc., with anyone
except for Brian Stevens, Haileab Hilafu, and Adam Petrie.Otherwise, the exam is open notes.However, if you use any resources like books,
websites, etc. that are NOT the class notes or assignment solutions, list them at the end of your document and on Canvas with your submission.
In my opinion, referencing the summary .R files for clustering and association rules, and homework/lab solutions for predictive
analytics will make the exam go quite smoothly!
***************
Question 1:TRANS is a transactional object with 652 transactions involving 100 possible items.
Left-arrow into Q1 the NAME of the item, e.g. Item1, Item2, etc., in quotes, that appears LEAST frequently in the data.
If there is a tie, choose only one of the names to left-arrow (grading key will count any members of the tie correct).
If you hard-code your answer, be sure to respect capitalization.
When you enter the value on Canvas, quotes are not needed.
`{r}
#code
str(TRANS)
itemFrequency(TRANS)
min(itemFrequency(TRANS))
which(itemFrequency(TRANS)==min(itemFrequency(TRANS)))
Q1<-‘Item48’“`************Question 2:create a ruleset from TRANS with: * a minimum support of 12 transactions* a minimum level of confidence of 0.49 * a minimum length of 2 and maximum length of 4. Remove redundant and non-significant rules.Then, make a subset of rules whose confidences are less than 1 Consider all rules that have the largest value of SUPPORT(there may be only one)Left-arrow into Q2 the largest numerical value of theCONFIDENCEamong those rules.Sanity check:after taking out redundant, non-significant, and 100% confidence rules 26 rules should exist.“`{r}#codeRULES <- apriori(TRANS, parameter = list(supp = 12/652, conf = 0.49,minlen=2, maxlen=4),control=list(verbose=FALSE))RULES <- RULES[!is.redundant(RULES)]RULES <- RULES[is.significant(RULES,TRANS)]rules_subset <- subset(RULES, confidence < 1)length(rules_subset)inspect(sort(rules_subset,by=”lift”,decreasing=TRUE)[1:26])Q2<-0.9375“`************Questions 3 and 4:the DATA.FOR.CLUSTERING dataframe contains information on 6 quantities of 300 individuals. All characteristics have been scaled and transformed, so use the data as-is. Create a K-means clustering scheme with K=5, making sure to use iter.max=450 and nstart=150.Left-arrow into Q3 the number of individuals in the most populated cluster.Note:it is possible that a cluster contains only a single individual. Then, look at the cluster centers and left-arrow into Q4 the most positive value in the x2 column.“`{r}#codeKMEANS <- kmeans(DATA.FOR.CLUSTERING,center=5,iter.max=450,nstart=150)KMEANSQ3<-121Q4<-2.73350294“`************Questions 5 and 6:now make a hierarchical clustering scheme using single linkage for measuring distances.Make your scheme have 5 clusters.Left-arrow into Q5 the number of individuals in the least populated cluster.Note:it is possible that one or more clusters contain a single individual. Run X<-DATA.FOR.CLUSTERING to copy the data into X, then add a column to X giving the cluster identities, thenuse aggregate() to find the cluster centers.Left-arrow into Q6 the most positive value of the clusters’ average values of x2 “`{r}#codeX<-DATA.FOR.CLUSTERINGHC <- hclust(dist(X),method=”ward.D2″)plot(HC)X$k5 <- cutree(HC,k=5)aggregate(.~k5,data=X,FUN=mean)Q5<-Q6<-mean(c(2.529716,0.4404375))“`************Question 7:eliminating unnecessary and redundant predictors can help increase the peformance of predictive models. Consider the dataframe DATA.FOR.CLEANING. Left-arrow into Q12 the total number of near-zero-variance predictors in the dataframe. “`{r}scale_dataframe <- function(DATA,except=NA) {column.classes <- unlist(lapply(DATA,FUN=class))numeric.columns <- which(column.classes %in% c(“numeric”,”integer”))if(!is.na(except)) { exemptions <- which( names(DATA) %in% except )if ( length(exemptions)>0 ) { numeric.columns <- setdiff(numeric.column,exemptions) } }if( length(numeric.columns)==0 ) { return(DATA) }DATA[,numeric.columns] <- as.data.frame( scale(DATA[,numeric.columns]) )return(DATA)}X7<-DATA.FOR.CLEANINGfor (i in 1:nrow(X7)) { X7[i,] <- X7[i,]/sum(X7[i,])}summary(apply(X7,1,sum))X7$OTHER <- NULLX7.SCALED <- scale_dataframe( log10(X7+0.01) )dim(X7.SCALED)summary(X7)#code“`************The remaining questions deal with fitting predictive models with train() on PREDICTCLASS and PREDICTNUMBER, and then potentially using those models to make predictions on TESTCLASS or TESTNUMBER, respectively. 1)All dataframes will be used as-is; no transformations of any variables. All models predict y from all variables. 2)All models are to be fit on the entirety of PREDICTCLASS or PREDICTNUMBER; no splitting into training/holdout. 3)Note: in the past you ran fitControl <- trainControl(…) to set up how the models would estimate generalization error. Instead, fitControlRMSE, fitControlACCURACY, and fitControlAUC are defined in the environment for you, so all you need to do is set trControl= whichever of those three objects is appropriate object for the question asked. 4)You will need to set up the tuning grid appropriately to consider all combinations of the relevant tuning parameters. 5)Be sure to pass preProc=c(‘center’,’scale’) to each model and any additional arguments when appropriate. 6)Be extra mindful of train’s weird capitalization conventions for tuneGrid, preProc, trControl. Not capitalizing correctly does not give an error/warning and may output the wrong results. Double check your work by printing the object created by train() to the screen and verifying you see: Pre-processing: centered (7), scaled (7)Resampling: Cross-Validated (5 fold)7)Make sure to set the random number seed using your personal random number seed ( 158396 )preceeding and on the same line as train() as we have been doing on homeworks/labs. Question 8:left-arrow into Q8 the estimated generalization error of the ‘best’ regularized linear regression model when training on PREDICTNUMBER.Audition all combinations of the following values of alpha: 0.3 , 0.4and lambda: 0.003 , 0.032 “`{r}#code“`************Question 9:left-arrow into Q9 the estimated AUC of the ‘best’ nearest neighbor model when training on PREDICTCLASS.Audition the following values of k: 9 , 10 , 15 , 18 “`{r}#code“`************Question 10:left-arrow into Q10 the ‘best’ cp parameter of a vanilla partition model when training on PREDICTCLASS and tuning on accuracy.Audition the following values of cp: 4e-04 , 8e-04 , 0.0013 , 0.0032 , 0.0063 , 0.0079 , 0.01 “`{r}#code“`************Question 11:left-arrow into Q11 the estimated generalization error of the ‘best’ boosted tree (gbm) model when training on PREDICTNUMBER.Audition the following values of n.trees: 700 and 1000 interaction.depth: 2 and 4 shrinkage: 0.0075 and 0.025 n.minobsinnode: 5 and 8 Note:addingverbose=FALSEto train() is useful here so that you aren’t buried in output printed to the screen. “`{r}#code“`************Question 12:left-arrow into Q12 the estimated generalization error of the ‘best’ neural network (method=’nnet’) when training on PREDICTCLASS and tuning on AUC.Audition the following values of size:3 , 4 , 7 decay:0.316 , 3.162 Note:addingtrace=FALSEto train() is useful here so that you aren’t buried in output printed to the screen. “`{r}#code“`************Question 13:fit a vanilla linear regression on the individuals in PREDICTNUMBER, then left-arrow into Q13 the actual RMSE of the model when it makes predictions on the individuals in TESTNUMBER. “`{r}#code“`************Question 14:fit a vanilla logistic regression on the individuals in PREDICTCLASS, then left-arrow into Q14 the actual AUC of the model when it makes predictions on the individuals in TESTCLASS. “`{r}#code“`************Congratulations, you’re done!Make sure to input your answers up on Canvas as well.Canvas MAY mark them correct, but it doesn’t know the answers to your exam, so ignore the score it assigns. List any additional resources you used here: Grading Code (do not alter): gCgKShASSEDBj:RSUehePEOeT
Reviews
There are no reviews yet.