Assignment 5
EMATM0061: Statistical Computing and Empirical Methods, TB1, 2024
Introduction
This assignment is mainly based on Lectures 12, 13 and 14. It is recommended that you watch the video lectures before starting.
Create an R Markdown for the assignment
It is a good practice to use R Markdown to organise your code and results. For example, you can start with the template called Assignment02_TEMPLATE.Rmd which can be downloaded via Blackboard. If you try to write mathematical expressions in R Markdown, examples can be found in the document “Assignment_R MarkdownMathformulasandSymbolsExamples.rmd” (under the resource list tab on blackboard course webpage).
You can optionally submit this assignment by 13:00 Monday 21st October, which will help us understand your work but will not count towards your final grade. The submission point can be found under the assignment tab at Blackboards (click the title “Assignment 05” and upload a pdf file).
Load packages
Some of the questions in this assignment require the following packages.
1. The tidyverse package:
library(tidyverse)
2. The Stat2Data package and then the dataset Hawks:
library(Stat2Data)
data(“Hawks”)
1. Exploratory data analysis
This section covers some of the concepts from Lecture 12 on Exploratory Data Analysis.
We will use the Hawks dataset that you have loaded.
head(Hawks)
## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex
Wing
## 1 9 19 1992 13:30 877-76317 RT I
385
## 2 9 22 1992 10:30 877-76318 RT I
376
## 3 9 23 1992 12:45 877-76319 RT I
381
## 4 9 23 1992 10:50 745-49508 CH I F
265
## 5 9 27 1992 11:15 1253-98801 SS I F
205
## 6 9 28 1992 11:25 1207-55910 RT I
412
## Weight Culmen Hallux Tail StandardTail Tarsus WingPitFat KeelFat
Crop
## 1 920 25.7 30.1 219 NA NA NA NA
NA
## 2 930 NA NA 221 NA NA NA NA
NA
## 3 990 26.7 31.3 235 NA NA NA NA
NA
## 4 470 18.7 23.5 220 NA NA NA NA
NA
## 5 170 12.5 14.3 157 NA NA NA NA
NA
## 6 1090 28.5 32.2 230 NA NA NA NA
NA
1.1 Location estimators
(Q1) Let’s start by computing some location estimators for Hawks’ “Tail” .
First, create a vector called “HawksTail”, the elements of which are from the “Tail” column of Hawks data frame. The first part of the vector should look like:
## [1] 219 221 235 220 157 230
Second, use the “mean” and “median” functions to compute the sample mean and sample median from the vector “HawksTail” . (note that inputs of the “mean”
function are vectors. Type ?mean for further details).
1.2 Combining location estimators with the summarise function
(Q1) Use a combination of the “summarise()”, “mean()” and “median()” to compute the sample mean, sample median and trimmed sample mean (with q = 0.5) of the Hawk’s wing length and Hawk’s weight (i.e., the “Wing” and “Weight” columns). You may need to remove the NA values. What can you say by comparing the results of the median and the trimmed mean that you obtain?
Your result should look something like this:
## Wing_mean Wing_t_mean Wing_med Weight_mean Weight_t_mean
Weight_med
## 1 315.6375 370 370 772.0802 970
970
(Q2) Combine them with the “group_by()” function to obtain a breakdown by species. Your result should look something like this:
## # A tibble: 3 × 7
## Species Wing_mean Wing_t_mean Wing_med Weight_mean Weight_t_mean
Weight_med
##
## 1 CH 244. 240 240 420. 378.
378.
## 2 RT 383. 384 384 1094. 1070
1070
## 3 SS 185. 191 191 148. 155
155
1.3 Location and dispersion estimators under linear transformations
(Q1) Suppose that a variable of interest x has values x1, ⋯ ,xn. Suppose that x1, ⋯ ,xn~has a s~ample ~mean A. Let a,~b ∈ ℝ be real numbers and define a new variable x with x~1, ⋯ ,x~n defined by xi = axi + bfor i = 1,2, ⋯ ,n. What is the sample mean of x1, ⋯ ,xn as a function of a, b and A? (please write down your answer as an expression of a, A, and b. You don’t need to use R).
Now using the vector “HawksTail” that you created in Section 1.1 as data and letting a = 2 and b = 3, verify your conclusion using R codes: Compute the mean of
“HawksTail*a+b” and then compare it with the one obtained from the mean of “HawksTail” and your conclusion.
(Q2) Suppose further that x1, ⋯ ,x~n has s~ample variance p and standard deviation q. W~hat is~the sample variance of x1, ⋯ ,xn? What is the sample standard deviation of x1, ⋯ ,xn? (Please write down your results.)
Now using the vector “HawksTail” that you created in Section 1.1 as data and letting a = 2 and b = 3, verify your result using R codes again.
1.4 Robustness of location estimators
In this exercise we shall investigate the robustness of several location estimators: The sample mean, sample median and trimmed mean.
We begin by extracting a vector called “hal” consisting of the talon lengths of all the hawks with any missing values removed.
hal<-Hawks$Hallux # Extract the vector of hallux lengths
hal<-hal[!is.na(hal)] # Remove any nans
To investigate the effect of outliers on estimates of location we generate a new vector called ““corrupted_hall”” with 10 outliers each of value 100 created as follows:
outlier_val<-100
num_outliers<-10
corrupted_hal<-c(hal,rep(outlier_val,times=num_outliers))
We can then compute the mean of the original sample and the corrupted sample as follows.
mean(hal)
## [1] 26.41086
mean(corrupted_hal)
## [1] 27.21776
Now let’s investigate what happens as the number of outliers changes from 0 to
1000. The code below generates a vector called ““means_vect”” which gives the sample means of corrupted samples with different numbers of outliers. More
precisely, “means_vect” is a vector of length 1001 with the i-th entry equal to the mean of a sample with i − 1 outliers.
num_outliers_vect <- seq(0,1000)
means_vect <- c()
for(num_outliers in num_outliers_vect){
corrupted_hal <- c(hal,rep(outlier_val,times=num_outliers))
means_vect <- c(means_vect, mean(corrupted_hal))
}
(Q1) Sample median:
Copy and modify the above code to create an additional vector called
““medians_vect”” of length 1001 with the i-th entry equal to the median of a sample ““corrupted_hal”” with i − 1 outliers.
(Q2) Sample trimmed mean:
Amend the code further to add an additional vector called “t_means_vect” of length 1001 with the i-th entry equal to the trimmed mean of a sample with i − 1 outliers, where the trimmed mean has a trim fraction q = 0.1.
(Q3) Visualisation
Now you should have the vectors
““num_outliers_vect”“,”“means_vect”“,”“medians_vect”” and ““t_means_vect”“ . Combine these vectors into a data frame. with the following code.
df_means_medians <- data.frame(num_outliers=num_outliers_vect,
mean=means_vect,
t_mean=t_means_vect,
median=medians_vect)
Now use the code below to reshape and plot the data. Recall that the function “pivot_longer()” below is used to reshape the data. Your result should look like:
df_means_medians %>%
pivot_longer(!num_outliers, names_to = “Estimator”, values_to =
“Value”) %>%
ggplot(aes(x=num_outliers,color=Estimator,
linetype=Estimator,y=Value)) +
geom_line()+xlab(“Number of outliers”)
Which quantity is the most robust when the number of outliers is small? (Note that, in this experiment, the term outliers simply means the artificial data used to corrupt the vector. It is not related to the outliers computed in the next question).
1.5 Box plots and outliers
(Q1) Use the functions “ggplot()” and “geom_boxplot()” to create a box plot which summarises the distribution of hawk weights broken down by species. Your plot should look as follows:
Note the outliers are displayed as individual dots. (Q2) quantile and boxplots
Compute the 0.25-quantile, 0.5-quantile, 0.75-quantile of the “Weight” grouped by
Species. Your results should look like this
## # A tibble: 3 × 4
## Species quantile025 quantile050 quantile075
##
## 1 CH 335 378. 505
## 2 RT 980 1070 1210
## 3 SS 100 155 178.
Now compare these values with the boxplot above. Can you explain which parts of the boxplot these numbers correspond to?
(Q3) Outliers
Suppose we have a sample x1, … , xn. Let “q25” denote the 0.25-quantile of the
sample and let “q75” denote the 0.75-quantile of the sample. We can then define the interquartile range, denoted IQR by IQR := q75-q25. In the context of boxplots, an
outlier xi is any numerical value such that the following holds if either of the following holds:
xi < q25 − 1.5 × IQR, or
xi > q75 + 1.5 × IQR.
Create a function called ““num_outliers”” which computes the number of outliers within a sample (with missing values excluded). The function should take a vector as input as output a number.
Test your ““num_outliers”” function using the code below:
num_outliers( c(0, 40,60,185))
## [1] 1
(Q4) Outliers by group
Now combine your function “num_outliers()” with the functions “group_by()” and “summarise()” to compute the number of outliers for the three samples of hawk weights broken down by species. Your result should look as follows:
## # A tibble: 3 × 2
## Species num_outliers_weight
##
## 1 CH 3
## 2 RT 13
## 3 SS 4
You may want to go back to the above box plot to check the number of dots displayed for each group.
1.6 Covariance and correlation under linear transformations
(Q1) Compute the covariance and correlation between the “Weight” and “Wing” of the “Hawks” data. You can use the cov and cor functions.
(Q2) Suppose that we have a pair of variables: x with values x1, ⋯ ,xn and Y with values Y1, ⋯ , Yn. Suppose that x1, ⋯ ,xn and Y1, ⋯ , Yn have the sample covar~iance S a~nd corr~elation R. Let~a, b ∈ ℝ be real numbers and define a new variable x with x1, ⋯ ,xn defined by xi = axi + bfo~r i = 1,~2, ⋯ ,n~. In addition, ~let c, d ∈ ℝ be real numbers and define a new variable Y with Y1, ⋯ , Yn defined by Yi = cYi + d.
What is the covariance between x(~)1, ⋯ , x(~)n and Y(~) with Y1(~), ⋯ , Y(~)n (as a funct~ion of S~, a, b, c, d~)?Assu~ming~that a ≠ 0 and c ≠ 0, what is the correlation between x1, ⋯ ,xn and Y with Y1, ⋯ , Yn? Please write down the mathematical expressions.
Let a = 2.4, b = 7.1, c = −1, d = 3, and let x be the hawk’s weight and Y be the
hawk’s Wing. Verify your conclusion above with R codes in a similar way to Section 1.3 (Q1).
2. Random variables and discrete random variables
This section covers some of the concepts from Lecture 13.
Random variables and discrete random variables Suppose we have a probability space (Ω, ℰ, ℙ). A random variable is a mapping X: Ω → ℝ, such that
for every a, b ∈ ℝ, {w ∈ Ω:X(w) ∈ [a, b]} is an event in ℰ
A discrete random variable is a random variable X: Ω → ℝ whose distribution is supported on a discrete (and hence finite or countably infinite) set A ⊆ ℝ
Expectation
The expectation E(X) of the random variable X is defined by E(X) : = ∑x∈ℝx ⋅ px (x).
Linearity of Expectation: Given random variables X1,X2, ⋯Xn and numbers α1, α2, ⋯ , αn, we have
E (Σi1 αi Xi) = Σi1 αi E(Xi)
Equivalent condition for independent random variables
Let X1, ⋯ ,Xk : Ω → ℝ be a sequence of random variables. Then X1, ⋯ ,Xk are
independent if and only if the following relationship holds for every sequence of well-behaved function f1,f2, ⋯ ,fk,
E(f1(X1) ⋯ fk (Xk)) = E(f1(X1)) ⋯ E(fk (Xk)).
2.1 Expectation and variance
(Q1) Suppose that we have random variables X and Y. Recall that the covariance between X and Y is defined by
Cov(X, Y) : = E[(X − X(‾)) ⋅ (Y − Y(‾))]
where X(‾) and Y(‾) are the expectations of X and Y, respectively.
Now, suppose X and Y are independent. Apply the linearity of expectation and the above equivalent condition for independent random variables described above, to prove that Cov(X, Y) = 0. To apply the above equivalent condition for independent random variables, you may choose proper functions f1,f2, ⋯ that you need.
2.2 Distributions
Suppose that α, β ∈ [0,1] with α + β ≤ 1 and let x be a discrete random variable
with distribution supported on {0,3,10}. Suppose that ℙ(x = 3) = α and ℙ(x = 10) = β and ℙ(x ∉ {0,3,10}) = 0.
Given the random variable x, answer the following questions (Q1), (Q2), …, Q(4).
(Q1) Expectation and variance of a discrete random variable. Please write down the formula of the following key quantities as expressions of α and β .
1. What is the probability mass function px for x?
2. What is the expectation of x?
3. What is the variance of x?
4. What is the standard deviation of x?
(Q2) Distribution and distribution function.
Recall that the distribution of a random variable maps a subsets of ℝ to a real number.
1. Write down the distribution Px of x. You can use the indicator function 1s to
“indicate” if a number is in the sets. (To give an example, the distribution of a Bernoulli random variable can be written as P(s) = (1 − q)1s(0) +
q1s (1).)
2. Write down the distribution function Fx of x
(Q3) Variance and Covariance.
Define a new random variable Y = x1 + x2 + ⋯ + xn where x1,x2, ⋯ ,xn are independent random variables, each of which has the same distribution as the random variable x.
Derive the variance of Y as a mathematical expression of α, β and n. You can use the conclusion from Question 2.1 (Q1).
(Q4) Carry out an experiment with “R” programming to explore the distribution of the sum of n independent random variables when n is large.
Let Y be defined above, and additionally, let α = 0.2 and β = 0.3. Recall that
x1,x2, ⋯ ,xn are discrete random variables. Explain why the random variable Y is discrete.
Now let’s explore the probability mass function of Y, and see its behaviour when n is large.
(Step 1).
First, write a function called “Gen X numbers” to generate a sample for x1,x2, ⋯ ,xn. The function “Gen X numbers” should take a number “n” as input and return a
vector containing n random numbers generated from the distribution of x.
To generate a single random number of x, first generate a random number from the uniform distribution of [0,1]. If the generated number is in [0,0.5), assign 0 to the
value of x. If the generated number is in [0,5,0.7), assign 3 to the value of x. If the generated number is in [0,7,1], assign 10 to the value of x.
To generate random numbers from the uniform distribution of [0,1], you could use the “runif” function (type ?runif to see the information about runif).
The result may look like this:
Gen_X_numbers(4)
## [1] 0 10 3 0
# the results you get might be different because of randomness
(step 2)
Write a function called “Gen Y samples” to generate a sample of Y. The function
“Gen Y samples” should take m andn as input and return a data frame that has a
column called Y and m rows. The Y column contain sample of Y of size m (i.e., m
random numbers for the random variable Y : = x1 + x2 + ⋯ + xn). You may want to use the “Gen X numbers” function you defined and the “map_dbl” function for
iterations.
The result may look like this:
Gen_Y_samples(5, 2)
## index Y
## 1 1 10
## 2 2 13
## 3 3 10
## 4 4 13
## 5 5 3
# the results you get might be different because of randomness
(Step 3)
Now let n = 3 and therefore Y : = X1 + X2 + X3 . Generate m : = 50000 samples of Y. Use the “ggplot” and “geom_bar()” function to create a bar plot for the samples of Y.
Of course your results might look different because of the randomness when using the “runif()” function. Notice that Y takes values from a subset of [0,30].
(Step 4)
Now, increase the values of n to 20. Generate a new sample for Y and create a new bar plot for the sample of Y. What are the maximum and minimum values of the samples? What is the sample range?
The minimum value, maximum value, and range of the sample is
(Step 5) Next, increase n to 1000 and do a similar plot for the sample of Y.
The bar plots contain information about the distribution of y : = x1 + x2 + ⋯ + xn for different n.
Notice that as we increase n, the distribution of y tends to follow a bell-like shape. While y is a discrete random variable (no matter how large n is), the distribution looks closer to the distribution of a continuous random variable, which is known as a Gaussian random variable. We will explore this behaviour in our lecture 14.
Reviews
There are no reviews yet.