Write your name and section number at the top right of this page:
Class Section Number
Bret 8:50 001
Sahifa 1:20 003
Miranda 9:55 004
Sahifa 3:30 005
Sahifa 8:50 006
Cameron 1:20 007
As you complete the exam, write your initials at the top right of each other page.
If you need more room, there is a blank page at the end of the exam, or we can give you some scratch paper.
Some multiple choice questions are “Select ONE” while others are “Select ALL that apply”. Pay attention to the question type and only mark one option if it says “Select ONE”. Fill in the circles completely.
1. A dataframe lions has 32 rows, each representing a unique lion.
It has two numeric columns: age , each lion’s age in decimal years, and proportion.black , the percentage of that lion’s nose which is black.
Finally, it has a logical column adult , which is TRUE if age is greater than 3 and FALSE otherwise.
Consider the following code which attempts to make a scatter plot of age on the X axis vs. proportion.black on the y axis.
Which of the following statements are true about errors in the above code? Select ALL that apply.
fill = adult must be changed to color = adult to color the points differently. alpha = 0.5 must be moved outside of aes() , like size = 2 currently is. size = 2 must be moved within aes() , like alpha = 0.5 is. fill = adult must be moved to the aes() in the ggplot() call, not geom point() .
2. Using the same dataframe as Question 1, consider the following code which creates a new dataframe, lions summary .
Which of the following statements are true about lions summary ? Select ALL that apply.
lions summary has 2 rows.
lions summary has 2 columns. (False: adult, averageAge, and averagePropBlack.) lions summary has a column called ‘adult‘.
If there were any NA values in the age column of lions , all the values of averageAge in lions summary will be NA (Sneakily false: If there are only NA
values in one category of adult, the other can still function!)
3. A random viewer’s ideal movie length, in minutes, is approximately X ∼ N(120,15). Actual movie lengths are given by Y ∼ N(143,19).
(a) Which R code below calculates the probability that a random movie is shorter than the 40th percentile for ideal movie length? Select ONE.
qnorm(0.4, 120, 15) %>% pnorm(143, 19) qnorm(0.4, 143, 19) %>% pnorm(120, 15) pnorm(0.4, 120, 15) %>% qnorm(143, 19) pnorm(0.4, 143, 19) %>% qnorm(120, 15)
(b) What movie length y corresponds to a z-score of 1? Select ONE.
19143
124162 (143 + 19 = 162)
4. Your friend claims that a random variable, X, follows the following probability distribution:
Which of the following are true about your friend’s claim? Select ONE.
This distribution has negative values of x, therefore X is not a random variable.
This distribution has negative values of P(X = x), therefore X is not a random variable.
The area under this distribution is not 1, therefore X is not a random variable.
X is a valid random variable.
5. Consider ordering a ”footlong” (12 inch) sub from Subway.
Consider trying to estimate µ, the average length of the sub you receive when you order a 12 inch sub.
You do this by ordering 40 footlong subs and measuring their lengths in inches. (What a great problem this is.) You observe that the average length of your 40 subs is 11.5 inches, and the standard deviation of their lengths is 0.4 inches.
Consider testing H0 : µ = 12 vs. Hα : µ < 12.
Write a numeric expression (only containing numbers and artihmetic symbols like + and ×) for the test statistic of this test.
x¯ − µnull
√ sx/ n
Answer:
6. The fictitious distribution of the number of children in a household (X) is given below.
x 0 1 2 3
P(X = x) ? 0.4 0.25 0.15
(a) Which of the following statements about X is true? Select ONE.
X is a discrete RV.
X is a binomial RV.
X is a continuous RV.
X is a normal RV.
(b) What value of P(X = 0) makes this a valid probability distribution? Select ONE.
0.1
0.15
0.2
0.25
(c) What is the median number of children per household? Select ONE.
0
1
2
Not enough infomration to determine
7. Data is collected on the duration each winter that a lake’s surface is frozen over a period of 103 consecutive years.
The correlation coefficient between the variables is r = −0.4.
The year variable has a mean ¯x = 1950 and a standard deviation sx = 30.
The freeze duration variable has ¯y = 90 and a standard deviation sy = 17.
Consider fitting a simple linear regression model to this data.
Write a numerical expression (using only numbers and arithmetic symbols like + or ×) for the predicted freeze duration in the year 1890. No need to simplify.
Answer: (1890 is two x standard deviations BELOW the mean of x, so our predicted value will be r * two y standard deviations below the mean of y.)*
90 + (−0.4) × (−2) × 17
*You could also take the long way:*
and βˆ0 = 90 − βˆ1 ∗ 1950, so ˆy1 = βˆ0 + βˆ1 ∗ 1890
8. We continue to consider the lake freezing data from Problem 7.
A 95% confidence interval for the freeze duration of the true regression line at x = 1890 has the form:
yˆ1 ± c × SEC
A 95% prediction interval for the duration that the lake was frozen in the single year 1890 has the form
yˆ1 ± c × SEP
where c is a critical value from some distribution and SEC, SEP are some positive values. Reminder: There are 103 years in the dataset.
Which R code calculates the numeric critical value c? Select ONE.
qnorm(0.95) qnorm(0.975) qt(0.95, 101) qt(0.975, 101) qt(0.95, 102) qt(0.975, 102)
9. Using the expressions from the previous problem, circle the true relationship between SEC and SEP and briefly explain why.
SEC < SEP SEC = SEP SEC > SEP
Answer: A confidence interval for the height of the regression line is narrower than a prediction interval for at the same x point; but they have the same center and critical value. Therefore, it must be that SEC < SEP.
10. Consider trying to estimate the proportion of UW-Madison undergraduate students who will graduate at the end of this semester, p.
You ask 50 random students, and 7 of them will graduate at the end of the semester.
Consider using this information to calculate a confidence interval for p.
Which of the following statements are true? Select ALL that apply.
If we were to calculate an Agresti-Coull confidence interval for p, its center would be (7 + 1)/(50 + 2). (False: It would be (7 + 2)/(50 + 4).)
If we were to calculate a Wald confidence interval p, its center would be 7/50.
The upper bound of the Agresti-Coull confidence interval would be greater than the upper bound of the Wald confidence interval. (it’s shifted towards 0.5)
As we increase the confidence level of our interval towards 100%, our interval would widen and approach [0, 1].
11. We are interested in comparing the proportions of individuals with bachelor degrees among of adult men aged 25 and older in the states of Minnesota and Wisconsin. Let pM and pW represent these two population proportions. In random samples of nM = 300 Minnesota men and nW = 400 Wisconsin men aged 25 and older, the numbers of individuals with bachelor degrees are xM = 90 and xW = 100.
Without simplification, write a numerical expression using provided data for the standard error SEci used in a 95% confidence interval for pM − pW of the form
(point estimate) ± zcrit × SEci
when using the Agresti-Coull method.
Answer:
SEci
12. In the setting of the problem above, write a numerical expression using provided data for the standard error SEht of the test statistic Z:
pˆM − pˆW
Z = ∼ N(0,1)
SEht
Answer:
SEht
13. Suppose that the test statistic from the previous problem has the value z = 1.47 and that the p-value from a two-sided test is 0.14.
Which of the following statements are true? Select ALL that apply.
The test is statistically significant at an α = 0.05 level.
A 95% confidence interval for pM − pW will contain the value 0. ( insignificant p-value)
14. You ask a group of 30 people from Country A to each privately give you a random number between 1 and 1000.
You then ask a group of 40 people from Country B to each privately give you a random number between 1 and 1000.
Let µA be the true average value that all members of country A would give upon this request, and similarly for µB. You are interested in the quantity µA − µB.
Which of the following statements are true about how to approach this problem through statistical inference? Select ONE.
Two-sample means inference is impossible to conduct because the sample sizes are different.
Two-sample means inference is impossible to conduct because the random variables are technically discrete.
Two-sample inference is more appropriate than paired-sample inference in this case. Paired-sample inference is more appropriate than two-sample inference in this case.
15. Assume the correct value of the test statistic from problem 14 is -50. Which R code correctly calculates the p-value for this test? Select ONE.
pt(-50, df = W)
2*pt(-50, df = W)
2*pt(abs(-50), df = W)
1 – pt(-50, df = W)
Reviews
There are no reviews yet.