Final 291 section 2 300 points
1) A school has 1600 students and they are going to vote as to whether they
will completely convert the school completely off fossil fuels. How many
students would you have to poll to be 95% confident of the outcome within +/- 0.5% of the vote? (25 points)
2) Earthquakes can be broken into two classes based on the directions the
earth moves when they fracture. The classes can be compared across time, for whether earthquakes occur in a given region in a small time period, and if they occur in the next small time period, and it can be added up across
several time periods, below is a table for a region off Indonesia over a 30 year period
|
Second time period, earthquake happens |
Second time period no earthquake happens |
Marginal Sums |
First time period, Earthquake happens |
148 |
274 |
422 |
First time period no earthquake happens |
276 |
2626 |
2902 |
Marginal sums |
424 |
2900 |
3324 |
Is happening of an earthquake in one time period statistically independent of happening in the next time period? Test at the .01 level (20 points)
These are earthquakes of the same type, Which cells have higher than expected occurrence if independence is true. (Use the deviation table). (10 points)
3) The earthquake chart is the same chart, only comparing when earthquakes of different types follow one another
|
Second time period, earthquake happens |
Second time period no earthquake happens |
Marginal Sums |
First time period, Earthquake happens |
5 |
314 |
319 |
First time period no earthquake happens |
314 |
2691 |
3005 |
Marginal sums |
319 |
3005 |
3324 |
Are they statistically independent now (.01 level again) (16 points)
How do the deviations from expectation under independence differ from the chart in problem 2 (hint look at the pattern of pluses and minuses) (8 points)
If you think about what each cell means, what do these differences mean in terms of the way the two types of earthquakes interact (6 points)
4) In NCI60 in the ISLR data set (100)
a. Identify the cancer types with more than 3 cell lines present.
b. From those Identify cancers with hyper or hypo active genes at the 0.2 FDR level (not independent)
c. Identify common genes between every pair of the cancers identified in b.
d. Are there any genes shared as strangely active between 3 cancers?
5) The diabetes data set is a prospective study of onset of adult diabetes given
a number of risk factors among the Pima Indian tribe. Using the diabetes.csv data set (100)
a. Separate the first half of the data from the second half, use the first half for training, second for testing
b. Using the training data
i. Construct the full logistic regression model for outcome
ii. Using backwards selection construct the logistic regression model with every p value for the coefficients < .05 (Show Steps!!!)
c. Predict the “response” (eg type=”response”) for the full logistic regression model for
i. the training data set,
ii. the test data set,
d. Predict the “response” for the smallest logistic model from the backwards selection exercise
i. the training data set,
ii. the test data set,
e. Using random forest, build a model on the training data
f. You now have 3 models, Full Logistic, smallest logistic, and random forest. For predictions of each calculate and tabulate
i. Number of correct positives
ii. Number of False positives
iii. Number of correct negatives
iv. Number of false negatives.
g. Using the results off, is there one of the 3 methods which appears
best in modeling new results, or does it depend on whether it is more important to identify positives (predict diabetes) or negatives (predict health)
h. Now redo analysis twice using random selection of 384 out of 768 for training and the complement for testing. Is there anything you can conclude with this additional information about the merits of each approach?
6) Conceptual question: Suppose you have a null and alternative hypotheses
that are completely defined in terms of the specific probability distributions they represent. What is the main difference between using a likelihood ratio test, and using bayes rule to decide between the two. (20)
Reviews
There are no reviews yet.