[Solved] DATA201 Assignment 3 Probability and Statistics, Ethics and privacy

$25

File Name: DATA201_Assignment_3__Probability_and_Statistics,_Ethics_and_privacy.zip
File Size: 640.56 KB

SKU: [Solved] DATA201 Assignment 3 – Probability and Statistics, Ethics and privacy Category: Tag:
5/5 - (1 vote)

1. Calculate the probabilities of the various types of nest building. What is the probability that a male is involved in nest building?In [2]:table_nb = euro_birds[Nest.building].value_counts()prob_nb = pd.DataFrame({counts:table_nb, probs:table_nb/sum(table_nb)})print(The probability of a Male being involved in nest building is {0:1.4f}({1:3.2f}%).format(prob_nb[probs][M], prob_nb[probs][M]*100))The probability of a Male being involved in nest building is 0.0544(5.44%)1. How many bird species have a solely monogamous mating system?In [3]:table_ms = euro_birds[Mating.system].value_counts()print({0:3d} bird species have a solely monogamous mating system.format(table_ms[M ]))435 bird species have a solely monogamous mating system1. After monogamy only, what is the next most common mating system?In [4]:dict_ms = {M: monogamous, M,PA: monogamous & polyandrous,M,PG: monogamous & polygynous,M,PG,PA: monogamous, polygynous & polyandrous,M,PG,PM: monogamous, polygynous & promiscuous,M,PM: monogamous & promiscuous, PG: polygynous,PG,PA: polygynous & polyandrous,PG,PM: polygynous & promiscuous, PM: promiscuous}table_ms_sorted = table_ms.sort_values(ascending=False)print(After monogamy only, the next most common mating system is a {ms} mating system.format(ms = dict_ms[table_ms_sorted.index.values[1]]))After monogamy only, the next most common mating system is a monogamous & polygynous mating system1. What is the probability that a species is Sedentary (lives in the same area in both the breeding and nonbreeding season)?In [5]:table_sed = euro_birds[Sedentary].value_counts()prob_sedent = pd.DataFrame({counts:table_sed, probs:table_sed/sum(table_sed)}) print(A species has a {0:1.4f}({1:3.2f}%) chance of being sedentary.format(prob_seden t[probs][1.0], prob_sedent[probs][1.0]*100))A species has a 0.3727(37.27%) chance of being sedentary1. What is the probability that a species is Sedentary and occupies human settlements in its breeding area?In [6]:table_sed_hs = pd.crosstab(euro_birds[Sedentary], euro_birds[Human.settlements], no rmalize=all)print(The probability that a species is Sedentary and occupies human settlements in it s breeding area is {0:1.4f}({1:3.2f}%).format(table_sed_hs[1.0][1.0], table_sed_hs[1.0 ][1.0]*100))The probability that a species is Sedentary and occupies human settlements in its breeding area is 0.0661(6.61%)1. What is the probability that a Sedentary species occupies human settlements in its breeding area?In [7]:table_hs_given_sed = pd.crosstab(euro_birds[Sedentary], euro_birds[Human.settlement s], normalize=index)print(The probability that a Sedentary species occupies human settlements in its breed ing area is {0:1.4f}({1:3.2f}%).format(table_hs_given_sed[1.0][1.0], table_hs_given_se d[1.0][1.0]*100))The probability that a Sedentary species occupies human settlements in its breeding area is 0.1774(17.74%)1. What is the probability that a species is Sedentary, given that it occupies human settlements in its breeding area?In [8]:table_sed_given_hs = pd.crosstab(euro_birds[Sedentary], euro_birds[Human.settlement s], normalize=columns)print(The probability that a species is Sedentary, given that it occupies human settle ments in its breeding area is {0:1.4f}({1:3.2f})%.format(table_sed_given_hs[1.0][1.0], table_sed_given_hs[1.0][1.0]*100))The probability that a species is Sedentary, given that it occupies human settlements in its breeding area is 0.7021(70.21)%1. A test for Coronavirus is 70% likely to detect the infection if it is present, and 99.1% likely to return a negative test if the infection is absent. If the prevalence of the disease (the proportion of people who have the disease) is 0.1%, then what is the probability that a person who tests positive actually has the disease?In [9]:

The probability that a person who tests positive actually has the disease is 0.0722(7.22%)1. How would your answer above change if the probability of a false positive test was zero?If the probbaility of a false positive was 0: Then the probbaility of a true positive must be 100%In [10]:prob_pos = 0.7 prob_neg = 0.991 prob_disease = 0.001prob = (prob_pos*prob_disease)/((prob_pos*prob_disease)+0)print(The probability that a person who tests positive actually has the disease (withno possibility of a false positive) is {0:1.2f}({1:3d}%).format(prob, int(prob*100)))The probability that a person who tests positive actually has the disease(with no possibility of a false positive) is 1.00(100%)Theoretical Probability Distributions(5 Marks)1. A Poisson random variable is often used to model counts of customer arrivals in a shop. Assume that the number of customers to arrive in a particular hour follows a Poisson(5) distribution. Compute and plot the probabililty distribution of a Poisson(5) distribution. (Plot the distribution over the range 0 to 15.)In [11]:

1. Find out(a) The mean and variance of the distribution(b) The probability that two customers arrive in a particular hour(c) The probability fewer than 10 arrive(d) The probability that no more than 10 arrive(e) The probability that more than 15 arriveIn [12]:# In a poisson distribution, rate parameter == mean == varianceprint((a) The mean and variance of the distribution: {0:1.0f}.format(stats.poisson.me an(mu))) print((b) The probability that two customers arrive in a particular hour: P(X=2)~{0:1.4f}({1:3.2f}%).format(stats.poisson.pmf(2, 5), stats.poisson.pmf(2, 5)*100)) print((c) The probability fewer than 10 arrive: P(X<10)~P(X<=9)~{0:1.4f}({1:3.2f}%).f ormat(stats.poisson.cdf(9, 5), stats.poisson.cdf(9, 5)*100))print((d) The probability that no more than 10 arrive: P(X<=10)~{0:1.4f}({1:3.2f}%).format(stats.poisson.cdf(10, 5), stats.poisson.cdf(10, 5)*100))print((e) The probability that more than 15 arrive: P(X>15)~1-P(X<=15)~{0:1.4f}({1:3.2 f}%).format((1-stats.poisson.cdf(15, 5)), (1-stats.poisson.cdf(15, 5))*100))(a) The mean and variance of the distribution: 5(b) The probability that two customers arrive in a particular hour: P(X=2) ~0.0842(8.42%)(c) The probability fewer than 10 arrive: P(X<10)~P(X<=9)~0.9682(96.82%)(d) The probability that no more than 10 arrive: P(X<=10)~0.9863(98.63%) (e) The probability that more than 15 arrive: P(X>15)~1-P(X<=15)~0.0001(0.01%)Model fitting(9 Marks)Use the European Birds data set from above. before you start, ensure that you rename the columnSexual.Dimorphism as SexualDimorphism since the . in its name causes a problem for the ols fitting command. Use the rename() command to do this.In [13]:euro_birds = euro_birds.rename(columns={Sexual.dimorphism:SexualDimorphism})1. Draw a scatter plot of the log of female bill length against the log of female breeding weight. Distinguish using a plot symbol species that are or are not sexually dimorphic (with a difference between males and females in size/colour).In [14]:

1. Fit a regression model for log female bill length as predicted by log female breeding weight.(a) Print out a summary of the model fit(b) Plot the fitted curve onto the data(c) Draw a scatter plot of the residuals and comment on themIn [15]:### 13.(a) Print out a summary of the model fit print((a) Print out a summary of the model fit)fittedmodel = smf.ols(formula=logBillF ~ logWeightF, data=euro_birds).fit() pred = fittedmodel.predict(euro_birds) print(fittedmodel.summary())### 13.(b) Plot the fitted curve onto the data print((b) Plot the fitted curve onto the data)# Scatter plot from Q12 fig, ax = pl.subplots(1,1) pl.scatter(x, y, marker=., c=cols)ax.legend(handles=legend_elements, loc=lower right)ax.set(xlabel=log(weight in g), ylabel=log(length in cm), title=female breeding w eight vs. female bill length between sexual dimorphism)# Plotting the regression line x_min, x_max = ax.get_xbound() x_bounds = [x_min, x_max]x_vals = pd.DataFrame({logWeightF: x_bounds}) y_vals = fittedmodel.predict(x_vals) pl.plot(x_vals, y_vals, -, c=#f6b93b); pl.show()### 13.(c) Draw a scatter plot of the residuals and comment on them print((c) Draw a scatter plot of the residuals and comment on them) residuals = y pred fig, ax = pl.subplots(1,1)pl.scatter(x, residuals, marker=., c=cols)ax.set(xlabel=log(weight in g), ylabel=Residual, title=residuals) ax.legend(handles=legend_elements, loc=upper right) x_min, x_max = ax.get_xbound() x_bounds = [x_min, x_max] y_vals = [0, 0]pl.plot(x_bounds, y_vals, -, c=#f6b93b) pl.show()(a) Print out a summary of the model fitOLS Regression Results==============================================================================Dep. Variable: logBillF R-squared: 0.557Model: OLS Adj. R-squared:0.556Method: Least Squares F-statistic: 6 20.5Date: Fri, 15 May 2020 Prob (F-statistic): 2.51 e-89Time: 17:57:44 Log-Likelihood: -28 1.26No. Observations: 496 AIC: 5 66.5Df Residuals: 494 BIC: 5 74.9Df Model: 1Covariance Type: nonrobust============================================================================== coef std err t P>|t| [0.025 0. 975]Intercept 1.9272 0.054 35.584 0.000 1.8212.034logWeightF 0.2609 0.010 24.911 0.000 0.2400.281==============================================================================Omnibus: 22.890 Durbin-Watson: 0.546Prob(Omnibus): 0.000 Jarque-Bera (JB): 4 5.270Skew: 0.264 Prob(JB): 1.48 e-10Kurtosis: 4.383 Cond. No. 15.1========================================================================== ====Warnings:[1] Standard Errors assume that the covariance matrix of the errors is cor rectly specified.(b) Plot the fitted curve onto the data

(c) Draw a scatter plot of the residuals and comment on them

Commenting on the residuals: Theres a cluster towards the left, the rest of the data is rather sporadic and not tending towards zero which is what we want (the closer to zero the closer the prediction is to the actual), however there arent any extreme outliers. The residuals range from -1.0 to 1,5, which isnt a huge difference. There arent any clear patterns which is good, the fitted line seems to go through it smoothly, and no funnelling.1. Now add Sexual Dimorphism as a covariate, and see if that improves the model by inspecting the residual scatter plot.In [16]:

Commenting on the residuals with Sexual Dimorphism as a covariate: There isnt any noticeable difference in the plots, furthermore outputting the residuals without Sexual Dimorphism and comparing to the residuals with Sexual Dimorphism has shown that theres only a marginal difference between the two.Ethical and privacy issues(5 Marks)1. During the Coronavirus pandemic, many countries are investigating the electronic collection of contact data as a means of identifying close contacts of people who are diagnosed with Covid-19. One possibility is an app on a smartphone which broadcasts an identifier which other devices record when in close proximity. If a person is identified as a case, then the data from their app is used to identify the personal devices to which they have been close in the previous weeks.Write a short discussion, of about 250-300 words, about the issues of privacy, security and confidentiality that you see with a plan of this sort.There are privacy, security and confidentialty issues when it comes to contact tracing apps. The justification of these sort of apps rest solely on their potential to provide significant benefits, and where able, to mitigate the factors of risk to an individuals pivacy, security and confidentiality.Any application that utilises a device to communicate with other devices must be careful that it does not breach a persons right to privacy. Collecting data about you and who you are in close proximity on a daily basis can be a dangerous tool, especially with the potential of social network analysis and other data science tools. An application that traces a users movement is intrusive and predatory but then again its what Google does anyway. The app must make the end-users aware of what data theyre collecting, how theyre collecting it and why theyre collecting it. The app must be voluntary, this has a trade-off of participation vs forced invasion of privacy, in a scenario of a compulsory application: the need must be extremely justified. The scope of the application must not extend beyound that of covid-19.It is paramount that data of this sensitivity must not be in anyway viewed by those without authorisation. This includes not just hackers but others that might be interacting with this data in some capacity. Data should be encrypted, salted and hashed, and proper E2E encryption not whatever Zoom is using. The app must have limited use such that identifiable information musnt be viewable and the app is only used for its sole purpose.Confidentiality is the act of obscuring data so that its difficult (hopefully impossible) to trace data back to the individual. An app of this sort can potentially breach confidentiality by collecting personal data such as locations, the people theyre with etc. A way to mitigate this issue is injecting noise into the data so thats its harder to be identifiable.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] DATA201 Assignment 3 Probability and Statistics, Ethics and privacy
$25