[SOLVED] R math python statistic Problem Set 1

$25

File Name: R_math_python_statistic_Problem_Set_1.zip
File Size: 348.54 KB

5/5 - (1 vote)

Problem Set 1
Machine Learning
PPHA 30545
Due: Tuesday, January 21
We collect notation. For the regression model
Yi 0 Xi,11 Xi,pp i, i1,,n, 1
we have that n is the number of observations, p is the number of covariates, Ei0 and V i2.
Each question is equally weighted. You will be graded on the correctness of your code, but also on the clarity and quality of your plots and discussions.
1. This exercise aims at developing an understanding of the conservative nature of the Bonferroni correction, and at practicing the important skill of communicating statistical results and concepts using plots. The main objective of this exercise is to produce and discuss the plot required in 1.a1.b. Recall that for large n the tstatistic corresponding to the null hypothesis H0,j : jj0 is distributed with a Gaussian null distribution, specifically it is distributed as the normal random variable
zjN0,1, j1,,p. 2
1.a. Generate the following figure: for test statistics with null distribution 2, plot the probability of false rejection of the joint null that the p individual null distributions are the true distributions at critical level 0.95 against the value of p, using the Bonferroni correction.
1.b For different values of , generate the test statistics according to 0B z 1 1C 0B 11C
B z2 CN0,, B1C. B . CA B . . CA
zp1
Overlay the probability of false rejection versus p relationships for different values ofatop the plot for 1.a. In other words, all probability of false rejection versus p curves should be in the same plot.
1

1.c bonus question: what is the biggest value ofthat you can use such thatremains a covariance matrix?
2. This exercise aims at familiarizing yourself with the concepts of model selection as multiple hypothesis testing and outofsample fit criteria.
a. Download the data set Hitters from the ISLR library the R CRAN library com plementing the courses textbook.1
b. Divide the dataset into a training set of ntrain observations and a test set of ntest observations. Obviously, ntrainntestn, but you need to choose ntrain and ntest. Explain the tradeoff between having a big trainingsmall test versus small trainingbig test, and why the values you chose are reasonable. Some R functions useful for this question are presented in section 6.5 of your textbook.
c. Using the training data, select the model made of the 7 coefficients with the smallest pvalues according to the regression fit of the full model.
d. Using the training data, select the best model made of 7 coefficients according to the forward stepwise selection procedure p. 247 for hints. You do not need to code the procedure yourself use an R package! but explain what this procedure does, and how it is different from the one in c.
e. Using training data, select the best model made of 7 coefficients according to the best subset procedure p. 244 for hints.
f. Compute the sample mean squared error in the test set for each method fitted in c, d and e, and collect the results in a table. Discuss.
g. bonus question: Repeat exercises cf for different sizes of the subset of coefficients, and present your results in an extended table or plot.
h. bonus question: For selecting larger subsets with the best subset selection method, compare the performance of the package leaps with that of bestsubset.2 Consider adding interactions.
i. bonus question: Can you suggest a more efficient way to split and use the data as training and testing sets?
1The dataset will also be made available on Canvas for those who wish to use Python. 2Can be installed with the following commands
librarydevtools
installgithubreporyantibsbestsubset, subdirbestsubset
You will also need to install the Gurobi solver.
2

3. Preliminary analysis of the bikeshare.csv dataset. This dataset will come back later when we have more advanced methods in our toolkit. But we can already proceed to a preliminary analysis of the data. Find the data description sheet for this dataset in the assignment release page on canvas.3
a. The data has been aggregated to daily counts to run the simple regression daylm. What are the insample sumofsquared errors and R2 for this regression?
b. Write out the mathematical formula for daylm and describe it in words. Make sure to describe the probability model that is implied by the objective function weve minimized. Do you have any criticisms of this model?
c. AstandardizedresidualforresponseY andfittedvalueY isri YiYi. Cal culate the standardized residuals for daylm. Now, well call the outlier pvalue PZri where ZN 0, 1. In R, this is pnormabsstdresids. Calculate these pvalues. De scribe what null hypothesis distribution they correspond to and why small values indicate a possible outlier day.
d. Plot the pvalue distribution. What does it tell you about the assumptions of the probability model we used for our regression? Discuss.
4. bonus question: Consider the drawing on the second to last slide of the deck of Lecture 2. Produce the equivalent drawing to illustrate the omitted variable bias phenomenon.
hint: Think of YiXi,11Xi,22i as the long regression, YiXi,11i as the short regression, and produce a drawing that has the projection of Y on the span of X1 and X2, the projection of Y on the span of X1, and one other projection.
3This data set was put together for pedagogical purposes by Matt Taddy, who graciously shared it. 3

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] R math python statistic Problem Set 1
$25