SCHOOL OF DESIGN, COMMUNICATION AND IT
INFT6201 BIG DATA TUTORIAL PROJECT 2
This tutorial project is based on a dataset from the National Institute of Diabetes and Digestive and Kidney Disease, which is available from the UCI Machine Learning Repository (Lichman, 2013):
https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
EXERCISE 1 (1 MARK)
Use ggplot() to create a box plot that shows the BMI on the y-axis separately for women who have or have not been diagnosed with diabetes. Note: Only include those observations that have a BMI value of greater than 0.
EXERCISE 2 (2 MARK)
Use ggplot() to create a violin plot that shows the TSFT value on the y-axis separately for women who have or have not been diagnosed with diabetes. Use the Paired colour palette from the RColorBrewer library to fill the violin plots. Add a boxplot on top of the violin plot and add a point that indicates the mean value. Note: Only include those observations that have a TSFT value of greater than 0.
EXERCISE 3 (1 MARK)
[R-CODE]
Use the subset() command to create a subset of the dataframe that only includes observations with BMI > 0 and TSFT > 0. Name this dataframe pimadatasub. Then, using the newly created data frame pimadatasub, use the custom winsor() function discussed in the lecture slides in week 3 to create a new variable BMIwinsor based on the variable BMI. Use a multiplier of 1.5.
To make sure that the winsorising worked, compare the two variables by creating simplified box plots using the following commands.
with(pimadatasub, boxplot(BMI)) with(pimadatasub, boxplot(BMIwinsor))
[R-CODE]
[R-CODE]
1/3
EXERCISE 4 (2 MARKS)
Based on the dataset pimadatasub, create a new column agecat in the dataframe that describes the age category of a person. Distinguish between the following categories: 21 to 30, 31 to 50, 46 to 60, and 61 to 85. Convert the column into a factor variable using the as.factor() command.
Use ggplot() to create a scatterplot for BMI over TSFT. Indicate the different age categories by colouring the points in the scatterplot with the GrandBudapest palette of the wesanderson library package.
EXERCISE 5 (1 MARK)
[R-CODE]
Based on the dataset pimadatasub, use the ddply() function of the package plyr to create a data frame with the means and standard deviations of BMI, TFST, and BMI for the three different age categories (variable: agecat, cf. Exercise 4) and for the two different results of the diabetes test (positive / negative). The output should look like this:
EXERCISE 6 (2 MARKS) [R-CODE]
Based on the dataset pimadatasub, use a Bartletts test to test for variance homogeneity in the variable DBP across the three different age categories (variable: agecat, cf. Exercise 4). Interpret the results of the test and decide whether we should assume that the variances are homogeneous.
Then, use a one-way Analysis of Variance (ANOVA) to test whether there is a difference in mean DBP across the three different age categories and interpret the result. Conduct a PostHoc analysis to determine which groups are significantly different from each other. How does the result of the test of variance homogeneity affect the PostHoc analysis?
EXERCISE 7 (1 MARKS) [R-CODE]
Based on the dataset pimadatasub, compare the number of times a woman was pregnant across the three different agregroups (variable: agecat, cf. Exercise 4). Which test should we use to test whether there is a significant difference and why? Conduct the test in R and interpret the result.
REFERENCES
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[R-CODE]
2/3
DATASET
Salary Pima Indians Diabetes Database
Description
A diabetes dataset. All patients here are females at least 21 years old of Pima Indian heritage. Note: Even though the dataset donors made no such statement, it seems very likely that several values zero values encode missing data for several variables.
Usage
Pimadata
Format
A data frame with 768 observations on the following 9 variables.
timesPregnant Number of times pregnant
PCG Plasma glucose concentration a 2 hours in an oral glucose tolerance test DBP Diastolic blood pressure (mm Hg)
TSFT Triceps skin fold thickness (mm)
insulin 2-Hour serum insulin (mu U/ml)
BMI Body mass index (weight in kg/(height in m)^2)
DPF Diabetes pedigree function. It provides some data on diabetes mellitus
history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence gives an idea of the hereditary risk one might have with the onset of diabetes mellitus.
age Age (Years)
diabetes 1 tested positive for diabetes
0 tested negative for diabetes
Source
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
3/3
Reviews
There are no reviews yet.