Probability in Computing Spring 2017
Lab 5 Hypothesis testing and linear regression.
Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.
Note that the whole coding part of this lab must be done in R.
1 What to submit (MANDATORY)
1. PDF file ( (lab5 fisrtname lastname.pdf) )including plots and a snapshot of the code used to answer
the questions.
Names of the collaborators.
Number of late days for this assignment.
Number of late days so far.
References used
2. R script (lab5 firstname lastname.R) with the code used.
Failing to meet any of the above requirements will cause a decrease of your grade.
2 Background
2.1 Confidence interval for the population mean
In class we learned that according to the central limit theorem, the distribution of the sample mean Xn
is approximately a normal distribution with a mean of ( the population mean) and standard deviation
of
n
(where is the population standard deviation). For a random variable with a normal distribution,
the probability that its value is within 2 standard deviations of its mean is about 0.95. Obviously, if there
is a certain distance between the sample mean ( recall that the sample mean Xn =
1
n
n
i=1Xi) and the
population mean, we can describe that distance by starting at either value. So, if the sample mean Xn
falls within a certain distance of the population mean , then the population mean falls within the same
distance of the sample mean. Therefore, the statement, There is a 95% chance that the sample mean Xn
falls within 2 standard deviations of can be rephrased as: We are 95% confident that the population
mean falls within 2 standard deviations units of Xn. This second statement is exactly the interpretation
of the confidence interval. Similarly, if our hypothesis is that the population mean is equal to 0, and 0 is
within 2 standard deviations units of Xn, we say that the hypothesis is not rejected at a significance level
of = 0.05.
Definition:
Given a sample of size n, under the assumption that we know the population standard deviation , the two
sided confidence interval of our sample is computed as follows:
Xn z
n
(1)
where Xn is the sample mean and z is a multiplier that depends on the level of significance .
Some important values of z are :
For = 0.1 (90% confidence interval), z/2 = 1.645
For = 0.05 (95% confidence interval), z/2 = 1.96
For = 0.01 (99% confidence interval), z/2 = 2.576
Note that if we want to compute one sided confidence interval then we have to use z. This is because
in the case of one sided intervals we are interested only in the lower value (when the alternative hypothesis
is greater than) or the upper value (when the alternative hypothesis is less than).
1 of 6
Probability in Computing
Spring 2017
Lab 5 Hypothesis testing and linear regression.
Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.
Figure 1: Different types of confidence intervals, for all three figures = 0.10.
2.2 Hypothesis testing for population mean
Recall that there are basically 4 steps in the process of hypothesis testing:
1. State the null (H0) and alternative hypotheses (H1).
2. Collect relevant data from a random sample and summarize them (using a test statistic).
3. Find the p-value, the probability of observing data like those observed assuming that H0 is true.
4. Based on the p-value, decide whether we have enough evidence to reject H0 (and accept H1) , and
draw our conclusions in context. To make a decision we have to chose a significance level. In this lab,
unless explicitly stated, we will use 0.05 significance level.
Assume that is our population mean. Note that the null hypothesis always takes the form: H0 : = 0
(where 0 is some value). The test statistic can take one of the following three forms, depending on what is
our alternative hypothesis:
1. H1 : > 0 (right-tailed test)
2. H1 : < 0 (left-tailed test)3. H1 : 6= 0 (double-tailed test)In hypothesis testing we have to distinguish between two cases: 1) the case where the population standarddeviation () is known, and 2) the case where is unknown. In the first case the test we will use is called thez-test for the population mean . In the second case, the test is called the t-test for the populationmean .In the first case, the test statistic will have a standard normal (z) distribution (when H0 is true), and in thesecond case, the test statistic will have a t-distribution (when H0 is true).3 z-test for the population mean ( is known)3.1 Learning exampleThe SAT is constructed so that scores in each portion have a national average of 500 and standard deviationof 100. The distribution is close to normal. The dean of students of Ross College suspects that in recentyears the college attracts students who are more quantitatively inclined. A random sample of 4 students froma recent entering class at Ross College had an average math SAT (SAT-M) score of 550. Does this provideenough evidence for the dean to conclude that the mean SAT-M of all Ross college students is higher thanthe national mean of 500? Assume that the scores of all Ross College students are also normally distributedwith a standard deviation of 100.1. State null and alternative hypothesis.2 of 6Probability in ComputingSpring 2017Lab 5 – Hypothesis testing and linear regression.Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.When we discussed probability models based on sampling distributions, we concluded that sample mean,Xn, is a random variable with the following properties: The mean is the same as the population mean, . The standard deviation is n, where is the standard deviation of the population. The sample means are normally distributed if the underlying variable being sampled is normally dis-tributed in the population or the sample size is large enough to guarantee approximate normality.Recall that this last statement is the Central Limit Theorem. As a general guideline, if n > 30, the
Central Limit Theorem applies and we can use the normal distribution to model the distribution of
Xn
Based on this description of the sampling distribution of the sample mean Xn, we can define a test statis-
tic that measures the distance between the hypothesized value of (denoted 0) and the sample mean
(determined by the data) in standard deviation units. The test statistic is:
Zn =
Xn 0
n
(2)
Comments
Note that our test statistic (because it is a z-score), tells us how far Xn is from the null value 0
measured in standard deviations. Since Xn represents the data and 0 represents the null hypothesis,
the test statistic is a measure of how different our data are from what is claimed in the null hypothesis.
The larger the test statistic, the more evidence we have against H0, since what we saw in our data is
very different from what H0 claims.
All inference procedures are based on probability. We are trying to determine if our sample results
are likely or unlikely based on our assumptions about the population. This requires that we have a
probability model that describes the long-term behavior of sample results that are randomly collected
from a population that fits our hypothesis. For this reason, the Central Limit Theorem gives us criteria
for deciding if the z-test for the population mean can be used. We need to verify:
1. The sample is random (or at least can be considered as random in context).
2. We are in one of the three situations marked with yes in the following table:
Conditions: z-test for a population mean Small sample size (n 30) Large sample size (n > 30)
Variable xi in the population from normal distribution YES YES
Variable xi in the population not from normal distribution NO YES
3. If the conditions are met, then values of Xn = 1n
n
i=1 xi vary normally, or at least close enough
to normally to use a normal model to calculate probabilities. When Xn values are normal, then the
z-scores will be normally distributed with a mean of 0 and a standard deviation of 1.
Now lets get back to our SAT example.
2. Can we use the z-test to do our analysis? Hint: recall the condition we have to check
3. What is the value of the sample mean Xn?
4. What is the value of population standard deviation ?
3 of 6
Probability in Computing
Spring 2017
Lab 5 Hypothesis testing and linear regression.
Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.
5. What is the value of sample size ?
6. Compute the z-statistics and explain how one should interpret the result.
7. Find the p-value of the test using the normal table (http://www.normaltable.com/). Hint: Recall that
the p-value when H1 is greater than (right tailed z-test) is Pr(Z z). The normal table shows Pr(Z < z)8. Suppose we reject the null hypothesis if our results are significant at 5% level. Can we reject the nullhypothesis given the p-value we obtained?9. What would be the minimum sample size we need to reject the null hypothesis with a significance levelof 95%? Hint: first you have to find the z value for which p value 0.05. Then you can compute the n needed.10. Now lets verify our results with code. For this we are going to use R. Create a function calledsignificance that on input: the sample size n, the population standard deviation , the populationmean and the sample mean Xn, computes z and the p-value. Hint: in R the function pnorm computesPr(Z < z). To create a function in R you do the following function name = function(parameters)Submit your code.11. Execute the significance function for increasing values of the sample size (starting with n = 4 incrementevery time by 1) until the results are statistically significant, i.e, p-value 0.05. Provide a results tablewith the following 4 columns: n, z (test statistic), p-value and significant (yes/no). Which is theminimum sample size for which we can reject the null hypothesis? Using R you can test all the values of nfrom 4 to 14 by entering significance(5:14). Submit your code and table.3.2 ProblemEvery year, the Environmental Protection Agency (EPA) collects data on fuel economy (randomly samplingfrom the entire population). With rising gasoline prices, consumers are using these figures as they decidewhich automobile to purchase. We will look at two-seater automobiles, many of which are sporty vehicles.Based upon the latest 2017 EPA sample, we wish to test the hypothesis that the combined city and highwaymiles per gallon (mpg) of two-seater automobiles is greater than 20. The standard deviation for all vehiclesis 4.7 mpg. The dataset containing the data is epa.csv and the column you are interested in is COMB.MPG.12. State the null and alternative hypothesis13. Have the conditions that allow us to safely use the z-test been met?14. Compute the test statistics and the p-value using the normal table (http://www.normaltable.com/).15. the extend the function you wrote for question 10 such that on input the sample size n the population standard deviation the population mean the sample mean Xn alternative: either less,greater or two.sided indicating the form of the alternative hypoth-esis.computes and outputs sample mean, sample size, z and the p-value. Hint: recall that for the two sided testthe p value = 2 Pr(Z |z|)Submit your code.16. Provide the output of the function and verify that it matches the theoretical values computed above.17. Draw conclusions based on the context of the problem.4 of 6appleappleappleProbability in ComputingSpring 2017Lab 5 – Hypothesis testing and linear regression.Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.18. Compute the one sided confidence interval for = 0.05 (95% confidence interval), Provide both upperand lower values.19. Use R to plot the confidence interval computed above. Recall that we assume that the sample meanXn is normally distributed. Therefore you need to create a normal variable with mean equal to thesample mean and standard deviation equal to the population standard deviation and plot its probabilitydensity function. Then to the same figure add two vertical lines corresponding to the lower and upperconfidence interval computed above. Hint: the function abline it is used to add vertical or horizontal referencelines to a plot in R. Submit your code and plot.20. What would be the minimum sample size we need to reject the null hypothesis with a significance levelof 95%?3.3 Relating Hypothesis Tests and Confidence IntervalsSuppose we want to test H0 : = 0 vs. H1 : 6= 0 using a significance level of = 0.05. An alternativeway to perform this test is to find a 95% confidence interval for and make the following conclusions: If 0 falls outside the confidence interval, reject H0. If 0 falls inside the confidence interval, do not reject H0.21. Compute the one sided confidence interval for the SAT problem for = 0.05. Provide both upper andlower value of the confidence interval.22. Does 0 (the population mean) fall outside or inside the confidence interval?23. Now compute the confidence interval assuming n = 11.24. Does 0 (the population mean) fall outside or inside the confidence interval?4 t-test for the population mean ( is unknown)Unfortunately, only in few cases it is reasonable to assume that the population standard deviation () isknown. What can we use to replace ? If you dont know the population standard deviation, the best youcan do is find the sample standard deviation, S (which formula is( 1n1ni=1(xi Xn)2)), and use itinstead of . In doing so we also have to change the test we use in the hypothesis testing which is now thet-test. The condition under which we can apply the t-test are the same expressed for the z-test (see Table 2).The test statistic for the t-test is defined as:t =Xn 0Sn(3)In the denominator we are using S instead of . This change has an effect on the distribution of the t-test statistic, which now does not follow a normal distribution. Instead it follows a distribution called tdistribution or student distribution.The t distribution has slightly less area near the expected central value than the normal distribution does,and that the t distribution has correspondingly more area in the tails than the normal distribution does.Therefore, the t distribution ends up being the appropriate model in certain cases where there is morevariability than would be predicted by the normal distribution.There are actually many different t distributions. The particular form of the t distribution is determined byits degrees of freedom. The degrees of freedom refers to the number of independent observations in a setof data. When estimating a mean score or a proportion from a single sample, the number of independentobservations is equal to the sample size minus one. This is important when we want to compute the p-valuesfor our hypothesis testing exercise: if the sample size is n = 10, we will compute the p-value of t(n1) = t(9).5 of 6appleProbability in ComputingSpring 2017Lab 5 – Hypothesis testing and linear regression.Due: Wednesday May 3rd 11.59PM in PDF form via Websubmit.25. In order to compare the normal distribution with the t-distribution, plot the density of a normaldistribution for 100 values in the range [4, 4]. Then, to the same plot, add the density function of at-distribution for the following values of degree of freedom: df = {1, 3, 8, 30} Submit your code andplot.26. What happens to the t-distribution when we increase the degrees of freedom?4.1 ProblemWe are going to use the SAT problem we analyzed in the previous section but with a little modification.Now we dont know . Instead we will use the sample standard deviation S (which we can compute fromthe sample) as an approximation for . This change implies that the z-test is not longer appropriate and weneed to use the t-test.27. Can we use the t-test to do our analysis? Hint: recall the condition we have to check (see Table 2)28. How many degrees of freedom we have?29. Given S = 100 compute the t-statistic and explain how one should interpret the result.30. Find the p-value of the test using R. Hint: the function is called pt. Recall that the p-value when H1 is greaterthan (right tailed z-test), pt by default computes Pr(T < t).31. Is the p-value for the t-test larger or smaller than the p-value we computed with the z-test? Is itsurprising?32. Suppose we reject the null hypothesis if our results are significant at 5% level. Can we reject the nullhypothesis given the p-value we obtained?33. Compute the 95% one sided confidence interval ( = 0.05). In order to compute it you need to findthe t statistic value t. Provide both lower and upper values of the interval. Is the confidence intervalwider that the one computed using the population standard deviation in part 21? Why?Hint: the Rfunction is qt and it computes the t value for a one sided t test. You need to use S = 100 in the confidence intervalformula since we do not have 6 of 6appleapple
Reviews
There are no reviews yet.