5/5 – (4 votes)
Your solutions to theoretical questions should be done in Markdown/MathJax directly below the associated question. Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions. Remember that you are encouraged to discuss the problems with your instructors and classmates, but you must write all code and solutions on your own.
NOTES:
- Do NOT load or use any Python packages that are not available in Anaconda 3.6.
- Some problems with code may be autograded. If we provide a function API do not change it. If we do not provide a function API then youre free to structure your code however you like.
- Because you can technically evaluate notebook cells is a non-linear order, its a good idea to do Cell Run All as a check before submitting your solutions. That way if we need to run your code you will know that it will work as expected.
- Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc.
- This should go without saying, but For any question that asks you to calculate something, you must show all work to receive credit. Sparse or nonexistent work will receive sparse or nonexistent credit.
Shortcuts: Problem 1 | Problem 2 | Problem 3 | Problem 4
In [ ]:
import numpy as npimport matplotlib.pyplot as pltimport pandas as pdimport scipy.stats as stats%matplotlib inline
[30 points] Problem 1 Sea-level rise, schmee-level rise!
You have been contacted by the local government of Key West, Florida, to assess whether there is statistical evidence for sea-level rise in the area. You obtain from the University of Hawaii Sea Level Centers gigantic repository of sea-level data the daily mean sea levels file linked here and below.
In this problem, you will:
- practice calculating confidence intervals,
- practice wrangling a real-life data set into a form where you can actually compute these confidence intervals, because life will rarely be so kind as to simply hand you a nicely packaged and cleaned set of data, and
- save Key West from a watery fate?
In [ ]:
# Local and web paths to the data; pick which works for you.local_path = "data/sealevel_keywest.csv"web_path = "https://raw.githubusercontent.com/dblarremore/csci3022/master/homework/homework5/data/sealevel_keywest.csv"file_path = local_pathdfSL = pd.read_csv(file_path, header=None)dfSL.rename(columns={0 : 'Year', 1 : 'Month', 2 : 'Day', 3 : 'SL'}, inplace=True)dfSL.head()
Part A: Write a function clean_data
to:
- take in a single argument of a raw sea level data frame (e.g.,
dfSL
above), - compute the fill-value used to replace missing sea level (SL) data (not hard-coded!),
- use the Pandas
DataFrame.dropna
method to remove all missing rows of data, - select only the data point on the second day of each month, and
- return a cleaned Pandas data frame.
Use your shiny new function to clean the dfSL
data frame and save the results in a new data frame.
There is a very specific reason to sample only one daily data point per month. We will talk about it later.
In [ ]:
def clean_data(df): # your code goes here! dfClean = df return dfCleandfClean = clean_data(dfSL)dfClean.head()
Part B: Plot the cleaned time series of sea levels. Be sure to label your axes, including units. The UHSLC data portal includes a link to the metadata accompanying our data set; if you are not sure about units, that would be a good place to start looking. For the -axis, place the tick marks on January 2 of each year that is divisible by 10 (i.e., 1920, 1930, ), and label with that year. You may need to do additional processing in order to grab these indices.
Bonus challenge (0 points): Why do we choose to work with the second day of each month instead of the first? You may need to look at the original data set to answer this.
In [ ]:
Part C: Use your cleaned sea levels data frame to create two new Pandas data frames or series:
- one object to contain the sea levels between (and including) the years 1986 and 1995, and
- another object to contain the sea levels between (and including) the years 2006 and 2015.
Then, create a single-panel figure that includes density histograms of each decade of sea levels. Be sure to label everything appropriately.
Finally, based on the data in front of you, formulate and state a hypothesis about how the mean sea level in the decade 2006-2015 compares to the mean sea level in the decade 1986-1995.
In [ ]:
Part D: Compute a 99.9% confidence interval for each of (1) the mean sea level in the 1986-1995 decade () and (2) the mean sea level in the 2006-2015 decade (). You may use Python for arithmetic operations and executing the calculations, but the relevant steps/set-up should be displayed in Markdown/MathJax.
Based on these two confidence intervals, do you think there is sufficient evidence to conclude that there is or is not a significant difference in the mean sea level between 1986-1995 and 2006-2015? Justify your answer.
In [ ]:
Part E: Compute a 99.9% confidence interval for the difference in mean sea level between the 2006-2015 and the 1986-1995 decades (. Based on this, make a conclusion regarding your hypothesis from Part C, and compare to what your results in Part D implied. You may use Python for arithmetic operations and executing the calculations, but the relevant steps/set-up should be displayed in Markdown/MathJax.
In [ ]:
Part F: The confidence intervals from Parts D and E were derived using the Central Limit Theorem. Which assumption of the Central Limit Theorem would likely be violated if we took more than one measurement per month to form our samples, and why?
[25 points] Problem 2 Quality of Red vs White Wine
Part A: Load the data in winequalityred.csv
and winequalitywhite.csv
into Pandas DataFrames. They are available under Resources on Piazza, and linked here and below. A description of this dataset can be found on UC Irvines Machine Learning Repository. The quantity of interest for this problem is the quality of the wine.
Are we justified in using the Central Limit Theorem in our analysis of estimates of the mean and proportions of the data? Justify your response.
In [ ]:
# read either local or web file version; pick whichever works for youlocal_file_white = "data/winequality-white.csv"local_file_red = "data/winequality-red.csv"web_file_white = "https://raw.githubusercontent.com/dblarremore/csci3022/master/homework/homework5/data/winequality-white.csv"web_file_red = "https://raw.githubusercontent.com/dblarremore/csci3022/master/homework/homework5/data/winequality-red.csv"dfRed = pd.read_csv(local_file_red, delimiter=';')dfWhite = pd.read_csv(local_file_white, delimiter=';')
In [ ]:
Part B: Let be a random variable denoting the quality of a bottle of wine, and let be a random variable denoting its color (either red () or white ()). For the remainder of this problem, we are concerned with probabilities such as If I buy a random bottle of red wine, what is the probability that its quality is at least a 7?. We could write that probability as , for example, and consider it the proportion of the population of red wines that are at least a 7 in quality. Calculate and report estimates of and .
Obtain 95% confidence intervals for the proportion of red and white wines that are at least a 7 in quality (obtain one CI for each color). Based on your results, if you are interested in buying many high quality bottles of wine but are buying totally at random, is one color a better bet than the other? Fully justify your answer.
Calculations may be executed in Python, but you need to set up your work (what it is you are calculating) in Markdown/MathJax.
In [ ]:
Part C: Now, as college students (and teachers), we might not be super concerned with buying a really high quality bottle of wine. Lets focus instead on making sure we do not buy a really disgusting bottle of wine. Calculate and report estimates of and .
Obtain 95% confidence intervals for the proportion of red and white wines that are at least a 5 in quality, that is, . Based on your results and what you saw in Problem 1 if you are interested in buying bottles of wine that are at least a 5 in quality, but are again buying wine totally randomly, can you conclude that you are better off buying one color over the other? Fully justify your answer.
In [ ]:
Part D: Compute a 95% confidence interval for the difference in proportions of red and white wines that are at least a 5 in quality.
Now, based on your results for this part, can you conclude that you are better off buying one color over the other? Fully justify your answer. How does your work here differ from your work in Part C?
In [ ]:
Part E: Now, we have many more observations of white wines than red. This certainly contributes to the width of the 95% confidence interval for the proportion of red wines that are at least a 5 in quality, which you should have found in Part C to be wider than the corresponding confidence interval for white wines.
How large would our sample size of red wines need to be in order to guarantee that this 95% confidence interval width is at most 0.01? Note that we are hypothetically adding more samples, so we do not know the precise value of .
In [ ]:
[30 points] Problem 3 Exploring Confidence Intervals
The Gumbel distribution is one of several distributions frequently used to model environmental extremes (for example, extreme temperatures and sea levels). It is also fairly asymmetric, and thus interesting for investigating confidence intervals. It is implemented in scipy.stats as gumbel_r, where the appendix _r denotes the right-skewed version of the Gumbel distribution (as opposed to the left-skewed).
Part A: Complete the following code cell to plot a histogram of 100 realizations from the Gumbel distribution with parameters and . Be sure to leave this cell executed before turning in your assignment! Make your histogram grey with gold edges.
In [ ]:
mu = 8beta = 2n_sample = 100# your code here
Part B: Look up the analytical mean and variance of the Gumbel distribution with parameters and and calculate them here by hand. Note that the EulerMascheroni constant can be accessed via np.euler_gamma
.
Use the empirical mean from your sample in Part A, and the true variance of the Gumbel distribution to compute by hand a 95% confidence interval for the mean.
In [ ]:
Part C: A theoretical interlude. When Stella OFlaherty (the famous octopus) ran her solution code for Part B, she obtained a 95% confidence interval of for the mean of the distribution. For each of the following, explain why or why not the situation described is correct, given the technical definition of a 95% confidence interval we went over in class.
(i) If you had no other evidence regarding true mean of the distribution, you could say there is a 95% chance that its true mean falls between 8.81 and 9.82.
(ii) If a class of 100 students all construct 95% confidence intervals for the mean of the distribution, then we expect about 95 of their CIs to contain the true mean, and about 5 of them to miss the true mean.
(iii) There is a 95% probability that any given random variable sampled from will be between 8.81 and 9.82.
Part D: In this part youll write a function to investigate the coverage properties of a confidence interval for the mean of the Gumbel distribution. Complete the following function to randomly sample sample means with sample size for the Gumbel distribution with parameters and . For each random sample, compute the 66% confidence interval for the mean. Note that you actually know that the variance for the true population distribution is, . Your function should do two things:
- Report the proportion of confidence intervals that successfully cover the true mean of the distribution
- Make a plot of 50 randomly selected confidence intervals. Overlay the intervals on the line (from Part B). Color confidence intervals black if they cover the true mean, and red if they dont.
Be sure to leave this cell executed before turning in your assignment!
In [ ]:
def confidence_intervals(m=500, n=100): mu = 8 beta = 2 # Your code here proportion_CIs_covering_mean = 0 print("proportion covering mean: {:.3f}".format(proportion_CIs_covering_mean)) confidence_intervals()
Part E: Does the proportion of confidence intervals that cover the true mean of the distribution agree with the theory described in class? Justify your conclusion.
[15 points] Problem 4 Freethrows
Keep your skills sharp by answering these straightforward questions.
Part A:
You have a shuffled deck of cards. It includes the usual 52 cards AND three special additional Octopus cards. You flip over the cards one by one, without replacing them in the deck. You count how many cards youll have to flip until you flip over the second Octopus. You repeat this many times. Simulate this process. Plot a histogram with binsize=1 of the outcomes, in lightgrey with white outline. Compute the mean, median, and mode for this dataset, indicate them on the plot too, using linstyles of green dashed, pink dotted, and black solid, respectively. Look up how to do a legend in MatPlotLib, and label your histogram, mean, median.
In [ ]:
Part B:
In general, which is wider: a 95% confidence interval or a 99% confidence interval? How would you explain this to your younger sibling, who is not a statistician?
In [ ]:
Part C:
Luckily, his fingertips also brush against your arm. Thats a foul, and everyone saw it. Back to the line. Back to CSCI3022:</font>
Let be a normally-distributed random variable. You draw from it and get these values, stored in the numpy array durant, below. Compute a 95% confidence interval for the standard deviation.
In [ ]:
durant = np.array([3.7778,3.9459,3.8248,4.1111,4.0180,4.0898,4.0380,3.9273,3.9614,3.8387])
In [ ]:
Part D:
If youre doing quality control for the average strength of carbon fiber that will be used in airplane construction, and your alternative hypothesis is that the strength of the carbon is below tolerance, and therefore unsafe, would you rather have a low Type I error rate or a low Type II error rate? Explain.
Part E:
You measure 53 suckers from baby reef octopuses and find that they are, on average, 45.2 mm wide, with a standard devaition of 30.4mm.
Then you measure 41 suckers from from baby dumbo octopuses and find that they are, on average, 52.8 mm wide, with a standard deviation of 22.8 mm.
Is there statistical evidence at the 0.05 significance level that the true mean of baby dumbo octopus sucker width exceeds the true mean of baby reef octopus sucker width by more than 6 mm? Use a test of your choice.
In [ ]:
Reviews
There are no reviews yet.