[SOLVED] CS

$25

File Name: CS_.zip
File Size: 28.26 KB

5/5 - (1 vote)

title: DSA8001 Practicals & Solutions
output:
html_notebook:
toc: true
toc_float: true

`{r include=FALSE}
if(!require(tidyverse)) install.packages(tidyverse)
if(!require(car)) install.packages(car)
library(tidyverse)
library(car)
`

# Analysing Normally Distributed Data

## Question 1

Head lengths of brushtail possums follow a nearly normal distribution with mean 92.6 mm and standard deviation 3.6 mm.

1. Compute the Z-scores for possums with head lengths of 95.4 mm and 85.8 mm.
2. Use calculated Z-scores to determine how many standard deviations above or below the mean measured head lengths of these two possums fall
3. Head length of which possum is more unusual?

### Solution

1. Compute the Z-scores for possums with head lengths of 95.4 mm and 85.8 mm

`{r}

z_score <- function(x, mu, sigma) {z = (x – mu) / sigmaz}#mean 92.6 mm and standard deviation 3.6 mmmu_possums = 92.6sigma_possums = 3.6# 1. Compute the Z-scores for possums with head lengths of 95.4 mm and 85.8 mmx1 <- 95.4×2 <- 85.8z1 <- z_score(x = x1, mu = mu_possums, sigma = sigma_possums)z2 <- z_score(x = x2, mu = mu_possums, sigma = sigma_possums)paste0(“z1 = “, round(z1, 2))paste0(“z2 = “, round(z2, 2))“`2. A possum with the head length of 95.4 is 0.78 standard deviations ABOVE the meanA possum with the head length of 85.8 is 1.89 standard deviations BELOW the mean3. Because `r paste0(“|z2| = “, round(abs(z2), 2), ” >= |z1| = , round(abs(z1), 2))`, opossum with the head length of 85.8 mm is more unusual than
opossum with the head length of 95.4 mm

## Question 2

Suppose the average number of Facebook friends is approximated well by the normal model N(mu = 1500, sigma = 300). Randomly selected person Julie has 1800 friends.

1. She would like to know what percentile she falls among other Facebook users?
2. What is the percentage of people that have more friends than Julie?

### Solution

1.

`{r}
julie_pctile <- pnorm(1800, mean = 1500, sd = 300) * 100julie_pctile“`Julie is `r round(julie_pctile, 2)` percentile.2. If `r round(julie_pctile, 2)`% have less facebook friends than Julie, than the proportion of people that have more friends is `r 100 – round(julie_pctile, 2)`%.## Question 3Suppose the average number of Facebook friends is approximated well by the normal model N(mu = 1500, sigma = 300). What is the probability that a randomly selected person has **AT LEAST** 1630 friends on Facebook? NOTE: Round solution to 3 decimal places.### Solution“`{r}# solution 1prob1 <- 1 – pnorm(1630, mean = 1500, sd = 300)prob1 <- round(prob1, 3)prob1# OR solution 2prob2 <- pnorm(1630, mean = 1500, sd = 300, lower.tail = FALSE)prob2 <- round(prob2, 3)prob2 # inspecting whether results are the sameprob1 == prob2“`The probability that randomly selected person has at least 1630 friends on Facebook is `r prob1`.## Question 4Suppose the average number of Facebook friends is approximated well by the normal model N(mu = 1500, sigma = 300). A randomly selected person is at the 79.95th percentile. How many Facebook friends does this person have?### Solution“`{r}friends_count <- qnorm(0.7995, mean = 1500, sd = 300)friends_count“`Randomly selected person, which is at the 79.95th percentile, has `r round(friends_count)` friends on Facebook.## Question 5At Heinz factory the amounts which go into bottles of ketchup are supposed to be normally distributed with mean 36 oz. standard deviation 0.11 oz. Once every 30 minutes a bottle is selected from the production line, and its contents are noted precisely. If the amount of ketchup in the bottle is below 35.8 oz. or above 36.2 oz., then the bottle fails the quality control inspection. 1. What percentage of bottles have less than 35.8 ounces of ketchup?2. What percentage of bottles **PASS** the quality control inspection? NOTE: Round solutions to 2 decimal places### Solution1.“`{r}paste0(round(pnorm(35.8, mean = 36, sd = 0.11) * 100, 2), “% of bottles have less than 35.8 oz of ketchup.”)“`2.“`{r}pass_inspection <- pnorm(36.2, mean = 36, sd = 0.11) – pnorm(35.8, mean = 36, sd = 0.11)pass_inspection# converting to percentages and rounding to 2 decimal placespass_inspection <- round(pass_inspection * 100, 2)pass_inspection““r pass_inspection`% of bottles pass inspection.## Question 6Body temperatures of healthy humans are distributed nearly normally with mean 98.2F and standard deviation 0.73F. What is the cutoff for the lowest 3% of human body temperatures?NOTE: Round solution to 1 decimal place.### Solution“`{r}paste0(“The cutoff value for the lowest 3% of human body temperature is “,round(qnorm(0.03, mean = 98.2, sd = 0.73), 1), “F”)“`# Statistical Inference## Test of single mean (mu) using the Z statistic### Question 1The mean content of a sample of 120 bottles of milk from one days output of a dairy was found to be 0.9975 litres, while the standard deviation of the sample was 0.012 litres. investigate if there is evidence to suggest that the mean content of that days output is different from 1 litre.#### Solution“`{r}# H0: mu = 1# HA: mu != 1mu0 <- 1x_bar <- 0.9975stdev <- 0.012n <- 120z <- (x_bar – mu0)/sqrt(stdev^2/n)paste0(“z = “, round(z, 2))# P(|Z| > 3.142506) = ?
# because we have two sided test p_value is calculated as follows
p_value <- 2 * pnorm(z)p_value <- round(p_value, 3)paste0(“p_value = “, p_value)paste0(“Test is significant at 5% level, because 1% < p = “,p_value * 100, “% <= 5%.There is considerable evidence for rejection H0 in favour HA.”)# calculating 95% C.I.z_critical <- qnorm(0.975)ci <- c(x_bar – z_critical * sqrt(stdev^2/n1), x_bar + z_critical * sqrt(stdev^2/n1))ci <- round(ci, 2)paste0(“95% C.I. is [“, ci[1], “, “, ci[2], “]”)“`## Test of the comparison of two means using the Z statistic### Question 1Suppose we wish to determine if there is a difference in mean weight between the two sexes in a particular bird species, at a 5% significance level. The following data were obtained: * Male sample size n1 = 125, mean weight x1_bar = 92.31 g, and variaiance var1 = 56.22 g^2* Female sample size n2 = 85, mean weight x2_bar = 88.84 g, and variaiance var2 = 65.41 g^2If significant, give a 95% CI for mu1 – mu2.#### Solution“`{r}# H0: mu1 = mu2# HA: mu1 != mu2 i.e. mu1 – mu2 != 0n1 <- 125x1_bar <- 92.31var1 <- 56.22n2 <- 85x2_bar <- 88.84var2 <- 65.41z <- (x1_bar – x2_bar) / sqrt((var1/n1) + (var2/n2))paste0(“z = “, round(z, 2))# P(|Z| > 3.142506) = ?
# because we have two sided test p_value is calculated as follows
p_value <- 2 * pnorm(z, lower.tail = FALSE)p_value <- round(p_value, 3)paste0(“p_value = “, p_value)paste0(“Test is highly significant at 5% level, because 0.1% < p = “,p_value * 100, “% <= 1%.There is considerable evidence for rejection H0 in favour HA.”)# calculating 95% C.I.z_critical <- qnorm(0.975)ci <- c((x1_bar – x2_bar) – z_critical * sqrt((var1/n1) + (var2/n2)), (x1_bar – x2_bar) + z_critical * sqrt((var1/n1) + (var2/n2)))ci <- round(ci, 2)paste0(“95% C.I. is [“, ci[1], “, “, ci[2], “]”)“`## Test of single proportion### Question 1In a random sample of 120 graduates, 78 spent 3 years at university and 42 more than 3 years. Test the hypothesis that 70% of graduates obtain degrees in a length of time which is different from 3 years. Give a 95% c.i. for the population proportion.NOTE: Round solution to 3 decimal places.#### Solution“`{r}# H0: p = 0.7# HA: p != 0.7p <- 0.7# sample sizen <- 120# sample proportionp_hat <- 78/120# Test statisticz <- (p_hat – p) / sqrt(p * (1 – p) / n)paste0(“z = “, round(z, 1))# because we have two sided test p_value is calculated as followsp_value <- 2 * pnorm(z)p_value <- round(p_value, 3)paste0(“p_value = “, p_value)paste0(“The test is not significant at 5% level and H0 is not rejected in favour of HA because p_value = “,p_value * 100, “% > 5%.
)

# 95% C.I.
z_critical <- qnorm(0.975)ci <- c(p_hat – z_critical * sqrt(p_hat * (1-p_hat) / n), p_hat + z_critical * sqrt(p_hat * (1-p_hat) / n))ci <- round(ci, 3)paste0(“95% C.I. is [“, ci[1], “, “, ci[2], “]”)# or alternatively test_results <- prop.test(x = c(78), n = c(120), p = c(0.7), correct = F, alternative = “two.sided”)test_resultsp_value_alt <- round(test_results$p.value, 3)paste0(“p_value_alt = “, p_value_alt)p_value_alt == p_valuepaste0(“The test is not significant at 5% level and H0 is not rejected in favour of HA because p_value_alt = “,p_value_alt * 100, “% > 5%.
)

# Slightly different values of the 95% CI are obtained because
# Rs CI formula is based on the Yates continuity correction formula which is used in Chi-square tests
ci_alt <- test_results$conf.intci_alt <- round(ci_alt, 3)paste0(“95% C.I. is [“, ci_alt[1], “, “, ci_alt[2], “]”)“`## Test of two proportions### Question 1We wish to compare the germination rates of spinach seeds for two different methods of preparation:* __Method A:__ 80 seeds sown, 65 germinate* __Method B:__ 90 seeds sown, 80 germinateNOTE: Round solution to 3 decimal places.#### Solution“`{r}# H0: p1 = p2# HA: p1 != p2n1 <- 80×1 <- 65n2 <- 90×2 <- 80p1_hat <- x1/n1p2_hat <- x2/n2p_hat <- (x1 + x2)/(n1 + n2)# Test statisticz <- (p1_hat – p2_hat) / sqrt(p_hat * (1 – p_hat) * (1/n1 + 1/n2))paste0(“z = “, round(z, 1))# because we have two sided test p_value is calculated as followsp_value <- 2 * pnorm(z)p_value <- round(p_value, 3)paste0(“p_value = “, p_value)paste0(“The test is not significant at 5% level and H0 is not rejected in favour of HA because p_value = “,p_value * 100, “% > 5%.
)

# 95% C.I.
z_critical <- qnorm(0.975)ci <- c((p1_hat – p2_hat) – z_critical * sqrt(p1_hat * (1 – p1_hat) / n1 + p2_hat * (1 – p2_hat) / n2), (p1_hat – p2_hat) + z_critical * sqrt(p1_hat * (1 – p1_hat) / n1 + p2_hat * (1 – p2_hat) / n2))ci <- round(ci, 3)paste0(“95% C.I. is [“, ci[1], “, “, ci[2], “]”)# or alternatively test_results <- prop.test(x = c(x1, x2), n = c(n1, n2), correct = F, alternative = “two.sided”)test_resultsp_value_alt <- round(test_results$p.value, 3)paste0(“p_value_alt = “, p_value_alt)p_value_alt == p_valuepaste0(“The test is not significant at 5% level and H0 is not rejected in favour of HA because p_value_alt = “,p_value_alt * 100, “% > 5%.
)

# Slightly different values of the 95% CI are obtained because
# Rs CI formula is based on the Yates continuity correction formula which is used in Chi-square tests
ci_alt <- test_results$conf.intci_alt <- round(ci_alt, 3)paste0(“95% C.I. is [“, ci_alt[1], “, “, ci_alt[2], “]”)“`## Tests based on the t-distribution (Single Mean)### Question 1What proportion of the t-distribution with 18 degrees of freedom falls below -2.10? NOTE: Round solution to 3 decimal places.#### Solution“`{r}# plain R syntaxround(pt(-2.1, df = 18), 3)# tidyverse syntaxpt(-2.1, df = 18) %>%
round(3)
`

### Question 2

What proportion of the t-distribution with 20 degrees of freedom falls above 1.65?

NOTE: Round solution to 3 decimal places.

#### Solution

`{r}
# tidyverse syntax
pt(1.65, df = 20, lower.tail = FALSE) %>%
round(3)
`

### Question 3

What proportion of the t-distribution with 2 degrees of freedom falls 3 standard deviations from the mean (above or below)?

NOTE: Round solution to 3 decimal places.

#### Solution

`{r}
# solution 1

below_3sd <- pt(-3, df = 2)above_3sd <- pt(3, df = 2, lower.tail = FALSE)paste0(“Proportion below 3 SD: “, below_3sd)paste0(“Proportion above 3 SD: “, above_3sd)paste0(“Proportion of the t-distribution that falls 3SD above or below the mean is: “,(below_3sd + above_3sd) %>%
round(3)
)

# solution 2

paste0(
Proportion of the t-distribution that falls 3SD above or below the mean is: ,
(2 * pt(-3, df = 2)) %>%
round(3)
)

# solution 3

paste0(
Proportion of the t-distribution that falls 3SD above or below the mean is: ,
(2 * pt(3, df = 2, lower.tail = FALSE)) %>%
round(3)
)

`

### Question 4

The temperature of warm water springs in a basin is reported to have a mean of 38C. A sample of 12 springs from the west end of the basin had mean temperature 39.4 and variance 1.92.

1. Have springs at the west end a greater mean temperature?
2. Have springs at the west end a different mean temperature? Give a 95% c.i. for the mean temperature.

NOTE: Round solution to 3 decimal places.

#### Solution

`{r}

t = (39.4 38) / sqrt(1.92 / 12)

paste0(Under H0, t=, t, is an observation from t11)

`

1.

`{r}

# H0: mu = 38
# H1: mu > 38

p_value <- pt(t, df = 11, lower.tail = FALSE) %>%
round(3)

paste0(p_value = , p_value)

paste0(Test is highly significant at 5% level, because 0.1% < p = “,p_value * 100, “% <= 1%.There is considerable evidence for rejection H0 in favour HA.”)“`2.“`{r}# H0: mu = 38# H1: mu != 38p_value <- 2 * pt(t, df = 11, lower.tail = FALSE) %>%
round(3)

paste0(p_value = , p_value)

paste0(Test is highly significant at 5% level, because 0.1% < p = “,p_value * 100, “% <= 1%.There is considerable evidence for rejection H0 in favour HA.”)# Give a 95% c.i. for the mean temperature# calculate a critical valuet11_critical <- qt(0.975, df=11) %>% # upper 2.5% of t11
round(3)

ci <- c(39.4 – t11_critical * sqrt(1.92 / 12), 39.4 + t11_critical * sqrt(1.92 / 12))cipaste0(“95% C.I. is [“, ci[1], “, “, ci[2], “]”)“`### Question 5Sweets producing company was interested in the mean net weight of contents in an advertised 80-gram pack. The manufacturer has precisely weighed the contents of 24 randomly selected 80-gram packs from different stores and recorded the weights as follows:“`{r}measured_weights <- c(87.7,80.01,77.28,78.76,81.52,74.2,80.71,79.5,77.87,81.94,80.7, 82.32,75.78,80.19,83.91,79.4,77.52,77.62,81.4,74.89,82.95, 73.59,77.92,77.18)“`1. Investigate the hypothesis that the sweets content in the packages is lesser than what is claimed on the package.2. Investigate the hypothesis that the sweets content in the packages is lesser than what is claimed on the package. Give a 95% c.i. for the mean weight.NOTE: Round solution to 3 decimal places.#### Solution“`{r}mu = 80 # advertised weight contentweights_mean <- mean(measured_weights)weights_meanweights_var <- var(measured_weights) # variancen <- length(measured_weights) # or it could be manually set as n <- 24t <- ((weights_mean – mu) / sqrt(weights_var / n)) %>%
round(3)

paste0(Under H0, t=, t, is an observation from t23)
`

1.

`{r}

# H0: mu = 80
# H1: mu < 80p_value <- pt(t, df = n – 1, lower.tail = TRUE) %>% # n = 24 which implies that df = n 1 = 23
round(3)

paste0(p_value = , p_value)

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA, because p = ,
p_value * 100,
% > 5%.
)

# alternative solution with the same result
test_results <- t.test(measured_weights, mu = 80, alternative = “less”)p_value_alt <- test_results$p.value %>% round(3)

# check whether the alternative solution matches with the one that we created manually
p_value_alt == p_value

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA, because p = ,
p_value_alt * 100,
% > 5%.
)
`

2.

`{r}
# H0: mu = 80
# H1: mu != 80

p_value <- 2 * pt(t, df = n-1) %>%
round(3)

paste0(p_value = , p_value)

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA, because p = ,
p_value * 100,
% > 5%.
)

# Give a 95% c.i. for the mean temperature

# calculate a critical value
t23_critical <- qt(0.975, df=23) %>% # upper 2.5% of t11
round(3)

ci <- c(weights_mean – t23_critical * sqrt(weights_var / n), weights_mean + t23_critical * sqrt(weights_var / n)) %>%
round(3)
ci
paste0(95% C.I. is [, ci[1], , , ci[2], ])

# alternative solution with the same result
test_results <- t.test(measured_weights, mu = 80, alternative = “two.sided”)p_value_alt <- test_results$p.value %>% round(3)

# check whether the alternative solution matches with the one that we created manually
p_value_alt == p_value

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA, because p = ,
p_value_alt * 100,
% > 5%.
)

ci_alt <- test_results$conf.int %>%
round(3)

paste0(95% C.I. is [, ci_alt[1], , , ci_alt[2], ])
`

## Tests based on the t-distribution (Paired Coparison)

### Question 1

Consider an experiment to compare the effects of two sleeping drugs A and B. There are 10 subjects and each subject receives treatment with each of the two drugs (the order of treatment being randomised). The number of hours slept by each subject is recorded. Is there any difference between the effects of the two drugs? Give a 95% c.i. for the unknown mean difference.

NOTE: Round solution to 3 decimal places.

`{r}
df <- data.frame(Subject_ID = 1:10,Hours_Sleeping_A = c(9.9, 8.8, 9.1, 8.1, 7.9, 12.4, 13.5, 9.6, 12.6, 11.4),Hours_Sleeping_B = c(8.7, 6.4, 7.8, 6.8, 7.9, 11.4, 11.7, 8.8, 8.0, 10.0))“`#### Solution“`{r}# mu_A – mu_B = 0# mu_A – mu_B != 0mu_diff = mean(df$Hours_Sleeping_A – df$Hours_Sleeping_B) # or alternatively: mu_diff = mean(df$Hours_Sleeping_A) – mean(df$Hours_Sleeping_B)mu_diff # mean of the differences var_diff = var(df$Hours_Sleeping_A – df$Hours_Sleeping_B)n = nrow(df)t = ((mu_diff – 0)/sqrt(var_diff / n)) %>%
round(3)

paste0(Under H0, t=, t, is an observation from t9)

p_value <- (2 * pt(t, df = n – 1, lower.tail = FALSE)) %>% round(3)
paste0(p_value = , p_value)

paste0(Test is highly significant at 5% level, because 0.1% < p = “,p_value * 100, “% <= 1%.There is considerable evidence for rejection H0 in favour HA.”)# calculate a critical valuet23_critical <- qt(0.975, df= n – 1) %>% # upper 2.5% of t11
round(3)

ci <- c(mu_diff – t23_critical * sqrt(var_diff / n), mu_diff + t23_critical * sqrt(var_diff / n)) %>%
round(3)
ci
paste0(95% C.I. is [, ci[1], , , ci[2], ])

# or alternative solution

# alternative solution with the same result
test_results <- t.test(df$Hours_Sleeping_A, df$Hours_Sleeping_B, paired = T, alternative = “two.sided”)p_value_alt <- test_results$p.value %>% round(3)

# check whether the alternative solution matches with the one that we created manually
p_value_alt == p_value

paste0(Test is highly significant at 5% level, because 0.1% < p_alt = “,p_value_alt * 100, “% <= 1%.There is considerable evidence for rejection H0 in favour HA.”)ci_alt <- test_results$conf.int %>%
round(3)

paste0(95% C.I. is [, ci_alt[1], , , ci_alt[2], ])
`

## Tests based on the t-distribution (Independent Samples)

### Question 1

Two methods of oxidation care are used in an industrial process. Repeated measurements of the oxidation time are made to test the hypothesis that the oxidation time of method 1 is different than that of method 2 on average.

* __Method 1:__ Sample size = 9, Sample mean = 41.3, Sample Variance = 20.7
* __Method 2:__ Sample size = 8, Sample mean = 48.9, Sample Variance = 34.2

Assuming that the unknown variances are equal, investigate if there is there any difference between the oxidation times of the two methods? Give a 95% c.i. for the unknown mean difference.

#### Solution

`{r}
# H0: mu1 mu2 = 0
# H1: mu1 mu2 != 0

n1 <- 9mean1 <- 41.3var1 <- 20.7n2 <- 8mean2 <- 48.9var2 <- 34.2df <- n1 + n2 – 2var_pooled <- ((n1 – 1) * var1 + (n2 – 1) * var2) / (n1 + n2 – 2)t <- ((mean1 – mean2) / sqrt(var_pooled * (1 / n1 + 1 / n2))) %>%
round(3)

paste0(Under H0, t=, t, is an observation from t, df)

p_value <- (2 * pt(t, df = df, lower.tail = TRUE)) %>% round(3)

paste0(p_value = , p_value)

paste0(Test is highly significant at 5% level, because 0.1% < p = “,p_value * 100, “% <= 1%.There is considerable evidence for rejection H0 in favour HA.”)“““{r}# 95% C.I.# calculate a critical valuet15_critical <- qt(0.975, df= df) %>% # upper 2.5% of t15
round(3)

ci <- c((mean1 – mean2) – t15_critical * sqrt(var_pooled * (1/n1 + 1/n2)), (mean1 – mean2) + t15_critical * sqrt(var_pooled * (1/n1 + 1/n2))) %>%
round(3)
ci
paste0(95% C.I. is [, ci[1], , , ci[2], ])
`

### Question 2

Two methods of oxidation care are used in an industrial process. Repeated measurements of the oxidation time are made to test the hypothesis that the oxidation time of method 1 is different than that of method 2 on average. The following measurements were recorded:

* __Method 1:__ `c(29.915269, 8.920123, 36.647273, 54.038639, 37.583526, 19.860171, 13.470132, 43.139612, 39.825299)`
* __Method 2:__ `c(28.970122, 43.563546, 4.161069, 39.774523, 5.705720, 93.562336,3.801087, 79.906087)`

Assuming that the unknown variances are equal, investigate if there is there any difference between the oxidation times of the two methods? Give a 95% c.i. for the unknown mean difference.

#### Solution

`{r}

method1 <- c(29.915269, 8.920123, 36.647273, 54.038639, 37.583526, 19.860171, 13.470132, 43.139612, 39.825299)method2 <- c(28.970122, 43.563546, 4.161069, 39.774523, 5.705720, 93.562336,3.801087, 79.906087)# regular R syntaxn1 <- length(method1)mean1 <- mean(method1)var1 <- var(method1)# dplyr syntaxn2 <- method2 %>% length()
mean2 <- method2 %>% mean()
var2 <- method2 %>% var()

df <- n1 + n2 – 2var_pooled <- ((n1 – 1) * var1 + (n2 – 1) * var2) / (n1 + n2 – 2)t <- ((mean1 – mean2) / sqrt(var_pooled * (1 / n1 + 1 / n2))) %>%
round(3)

paste0(Under H0, t=, t, is an observation from t, df)

p_value <- (2 * pt(t, df = df, lower.tail = TRUE)) %>% round(3)

paste0(p_value = , p_value)

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA because p = ,
p_value * 100,
% > 5%.
)
# 95% C.I.

# calculate a critical value
t15_critical <- qt(0.975, df= df) %>% # upper 2.5% of t15
round(3)

ci <- c((mean1 – mean2) – t15_critical * sqrt(var_pooled * (1/n1 + 1/n2)), (mean1 – mean2) + t15_critical * sqrt(var_pooled * (1/n1 + 1/n2))) %>%
round(3)
ci
paste0(95% C.I. is [, ci[1], , , ci[2], ])
`

Or Alternatively

`{r}
test_results <- t.test(method1, method2, var.equal = T, alternative = “two.sided”)p_value_alt <- test_results$p.value %>% round(3)

# check whether the alternative solution matches with the one that we created manually
p_value_alt == p_value

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA because p = ,
p_value_alt * 100,
% > 5%.
)

ci_alt <- test_results$conf.int %>%
round(3)

paste0(95% C.I. is [, ci_alt[1], , , ci_alt[2], ])
`

## Tests based on the Chi-Square distribution

### Question 1

What proportion of the chi-square distribution with 9 degrees of freedom falls above 17?

#### Question 1

`{r}
pchisq(17, df = 9, lower.tail = FALSE) %>% round(3)

# or alternatively
round(1 pchisq(17, df = 9), 3)
`

### Question 2

The geneticist Mendel evolved the theory that for a certain type of pea, the characteristics Round and Yellow, R and Green, Angular and Y, A and G occurred in the ratio 9:3:3:1. He classified 556 seeds and the observed frequencies were 315, 108, 101 and 32. Test Mendels theory on the basis of these data.

#### Solution

`{r}
# H0: p_ry = 9/16, p_rg = 3/16, p_ay = 3/16, p_ag = 1/16
# H1: probabilities not as in H0

# observations
o_ry <- 315o_rg <- 108o_ay <- 101o_ag <- 32observations <- c(o_ry, o_rg, o_ay, o_ag)df <- length(observations) – 1 # we have 4 groups so: df = 4 – 1 = 3o_total <- sum(observations) # or alternatively: o_total <- o_ry + o_rg + o_ay + o_ag # probabilitiesp_ry <- 9/16p_rg <- 3/16p_ay <- 3/16p_ag <- 1/16probabilities <- c(p_ry, p_rg, p_ay, p_ag)# expected frequenciesexpected_freqs <- o_total * probabilitiesphi_squared <- sum((observations – expected_freqs)^2/expected_freqs) %>% round(3)

paste0(Under H0, phi_squared=, phi_squared, is an observation from Chi-square_, df)

p_value <- pchisq(phi_squared, df = 3, lower.tail = FALSE) %>% round(3) # P(Chi-squared > 0.47)

paste0(p_value = , p_value)

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA because p = ,
p_value * 100,
% > 5%.
)
`

Or Alternatively

`{r}
chisq_test <- chisq.test(observations, p = probabilities)p_value_alt <- chisq_test$p.value %>% round(3)

p_value == p_value_alt

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA because p = ,
p_value_alt * 100,
% > 5%.
)
`

### Question 3

In a random sample of 120 graduates, 78 spent 3 years at University and 42 more than 3 years. Test hypothesis that 70% obtain degree in 3 years.

#### Solution

`{r}

# H0: P(degree in 3 years) = p = 0.7
# H1: p != 0.7

observations <- c(78, 42) # (degree in 3 years, degree in more than 3 years)df<- length(observations) – 1 # df = 2 – 1 = 1expectations <- c(0.7 * 120, 0.3 * 120) # (expected degree in 3 years, expected degree in more than 3 years)# because df = 1, we are using Yate’s correctionphi_squared <- sum(((abs(observations – expectations) – 1/2) ^ 2) / expectations ) %>% round(3)
phi_squared

paste0(Under H0, phi_squared=, phi_squared, is an observation from Chi-square_, df)

p_value <- pchisq(phi_squared, df = df, lower.tail = FALSE) %>% round(3) # P(Chi-squared > 1.2)

paste0(p_value = , p_value)

paste0(The test is not significant at 5% level and H0 is not rejected in favour of HA because p = ,
p_value * 100,
% > 5%.
)
`

## Analysis of Variance (ANOVA)

### Question1

**Iris** data set that comes preloaded with R gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

1. Create a copy of the Iris dataset that contains relevant variables, and:

1.1. Perform exploratory data analysis which will reveal how many flowers are in each category, how many sepal width values are missing, as well as what is the sepal width mean and standard deviation per each category.

1.2 Perform graphical analysis which will reveal how collected sepal width values compare across the species (histograms and boxplots)

1.3. Investigate if there is a difference between the means of the sepal width variable among this three species, and if there is, perform further investigation on which species have statistically significant difference between the means?

2. Similarly, repeat the investigation process for the other variables?

#### Solution

1. create a copy of the iris dataset that contains relevant variables

`{r}

head(iris)
glimpse(iris)

iris_copy <- iris %>%
select(
Sepal.Width, Species
)
`

1.1 Perform exploratory data analysis which will reveal how many flowers are in each category, how many sepal width values are missing, as well as what is the sepal width mean and standard deviation per each category.
`{r}
iris_copy %>%
group_by(Species) %>%
summarise(
n_samples = n(),
n_missing = sum(is.na(Sepal.Width)), # how many Sepal.Widths is missing in each category
mean_sepal_width = mean(Sepal.Width, na.rm = TRUE), # this dataset does not have missing values. If it did we would need na.rm=T
sd_sepal_width = sd(Sepal.Width, na.rm = TRUE)
)
`

1.2 Perform graphical analysis which will reveal how collected sepal width values compare across the species (histograms and boxplots)
`{r }
hist_setosa <- iris_copy %>%
filter(
Species == setosa
) %>%
pull(Sepal.Width) %>% # extract data.frame column as vector
hist(breaks = 10)

hist_versicolor <- iris_copy %>%
filter(
Species == versicolor
) %>%
pull(Sepal.Width) %>% # extract data.frame column as vector
hist(breaks = 10)

hist_virginica <- iris_copy %>%
filter(
Species == virginica
) %>%
pull(Sepal.Width) %>% # extract data.frame column as vector
hist(breaks = 10)
`

`{r}
plot(hist_setosa, col = scales::alpha(red,.3), main = Distribution of Sepal Widths, xlab = Sepal Width)
plot(hist_versicolor, col = scales::alpha(green,.3), add = T)
plot(hist_virginica, col = scales::alpha(blue,.3), add = T)
legend(topright, c(setosa, virginica, versicolor), col=c(scales::alpha(red,.3), scales::alpha(green,.3), scales::alpha(blue,.3)), lwd=10)

boxplot(iris_copy$Sepal.Width ~ iris_copy$Species)
`

1.3. Investigate if there is a difference between the means of the sepal width variable among this three species, and if there is, perform further investigation on which species have statistically significant difference between the means?

H0: The mean sepal width is the same across all species
HA: At least one mean is different than others

`{r}
# perform anova test
# NOTE: in our case if the variable Species was not of the factor type
# we would need to convert it to factor before conducting ANOVA test using the following approach
# iris_copy$Species <- factor(iris_copy$Species)anova_test <- aov(Sepal.Width ~ Species, data = iris_copy)# summary of the testsummary(anova_test)p_value <- summary(anova_test)[[1]][[1,’Pr(>F)]] %>%
round(3)

paste0(p_value = , p_value)

paste0(The test is very significant at 5% level because p = ,
p_value * 100,
% < 5%. We are very confidendt that HA is to be preferred to H0, i.e. at least one mean is different from the others”)“`**Check ANOVA conditions**Assuming the independence of data, we should check whether the other conditions are valid as well.**Condition 1:** The variability across the groups should be about equal.“`{r}plot(anova_test, 1)“`In the plot above, there is no evident relationship between residuals and fitted values which implies equal variances across the groups (homogeniety of variances).Alternatively, we can also use Levene’s test to test equality of variances:“`{r}# The variability across the groups should be about equal. leveneTest(Sepal.Width ~ Species, data = iris_copy)“`Because the p-value obtained from Levene’s test is p = 55.5% > 5%, the thest is not significant and there is no evidence to suggest that the variance across species is statistically significantly different (i.e. we can assume the equalirty of variance).

**Condition 2:** The observations within each group should be nearly normal.

This can be very difficult to determine in many real time situations, and to achieve this we need to mean-center each sepal width by its respective group mean. These group-wise, mean-centered values are also known as **residuals**, and by using them we can assess the normality of all observations as a whole.

`{r}
# Graphical method for assessing normality of residuals QQ Plot
plot(anova_test, 2)
`

Alternatively, for testing normality of residuals we can use **Shapiro-Wilk test**:

`{r}
anova_test_residuals <- residuals(anova_test)shapiro.test(anova_test_residuals)“`Because the p-value obtained from Shapiro-Wilk test is p = 32.3% > 5%, the thest is not significant and there is no evidence to suggest that the normality assumption is violated (i.e. we can assume normality of the residuals).

As we concluded that at least one pair of means differ, and because we do not know which one, we need to use **t-tests** with **Bonferroni correction** to compare each pair of means to each other (i.e. multiple comparisons).

`{r}
pairwise.t.test(iris_copy$Sepal.Width, iris_copy$Species, p.adjust.method = bonferroni)
`

**Conclusions:**

1. In the case of the mean difference in sepal widths between species **versicolor** and **virginica**, the test is highly significant at 5% level because 0.1% < p = 0.94% < 1%. There is considerable evidence for rejection H0 in favour of HA, i.e. considerable evidence of a difference between the average sepal widths of the species **versicolor** and **virginica**.2. In other two comparisons (**setosa-versicolor** and **setosa-virginica**) the test is very highly significant at 5% level because p = 0% < 1%. Therefore, we are very confident that HA is to be preferred to H0, i.e very confident that there is a difference between the average sepal widths of the species **setosa-versicolor** and **setosa-virginica**.## Simple Linear Regression### Question1**Iris** data set that comes preloaded with R gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Create a copy of the Iris dataset that contains variables Petal.Width and Petal.Length, and:1. Fit a simple linear regression create model for predicting petal widths based on the petal lengths.2. Plot the line of the best fit against the input dataset3. State the estimated simple linear regression equation4. State whether there is a significan relationship between the predictor and response variable.5. Assuming that both petal width and length values are given in milimetres (mm), state the interpretation of the estimated slope parameter6. If the petal lengths are 1.5, 1.6 and 1.7 mm, what are their estimated petal widths?#### Solution“`{r}# create a copy of the iris datasetiris_copy <- iris %>%
select(
Petal.Width,
Petal.Length
)
`

1.

`{r}
iris_model <- lm(Petal.Width ~ Petal.Length, data = iris_copy)“`2.“`{r}plot(iris$Petal.Length, iris$Petal.Width)abline(iris_model, col=”red”)“`3.“`{r}summary(iris_model)“`The summary output shows the following components:* __Call:__ Displays the formula that has been used to fit the regression model.* __Residuals:__ Displays summary statistics of residuals, which by definition should have a mean equal to zero. Therefore, as an indication of the normally distributed residuals, median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value. __NOTE:__ Similar to ANOVA, normality of the residuals can be inspected using Shapiro-Wilk test.* __Coefficients:__ Displays the values of the intercept and slope parameters,and their statistical significance. ANSWER:The estimated simple regression equation is: __Petal.Width_Estimated = -0.363076 + 0.415755 * Petal.Length__4.By looking at the summary output above, we can see that the p-value < 0.01%, so we can reject the null hypothesis that _1 = 0. Because the test is very highly significant at 5% level, we are very confident that there is a significant relationship between the predictor and the response variable in the linear regression model.5. For each additional mm increase in length, we would expect petal width to increase for 0.415755 mm.6.“`{r}new_data <-data.frame(Petal.Length = c(1.5, 1.6, 1.7)# predictor column name needs to be the same as the one in the original dataset)predict(iris_model, new_data)“`

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS
$25