#First, do all of the coding in an R script.
#When it is all working, you will use it to
#create an html file, using R Markdown
#Instructions for this part are at the bottom
#of this script.
#Hint: A parametric test assumes a distributional
#shape (normal, for example), as opposed to
#the randomization and bootstrap methods that
#we have used recently in class.
#################################################
#I.Setup tasks:
#Load ggplot2 and dplyr
#Download the genetic counseling data, which is
#the same as the data from Test 1.
#Both data sets are from a sample of patients who came to
#the UVM Medical Center for genetic counseling.
#Read in the Test2_gc.csv as gc, using the
#stringsAsFactors = FALSE argument.
#Read in Test2_gc_payment.csv as gcp; dont use
#the above argument for this file.
#Use dplyr to merge gc and gcp by matching cases.
#Call the result gcall, and make sure all of the
#cases in gc are in gcall.
#For tasks below, using gcall.
#################################################
#II.Data cleaning tasks:
#Recode the responses to ResState so they say
#VT, NH, NY or Other.
#Make the age vector into a factor, and apply
#meaningful value labels:Under 1,1 to 17,
#18 to 24, 25 to 29, 30 to 34 . 70 to 74,
#and 75 and up.
#Make values for Charges that are negative,
#or greater than 5000 into missing values.
#################################################
#III.Descriptive Stats
#Use dplyr to create a list of the 10 diagnoses
#(ccsdx) that appear most in the data, in descending order
#of appearance.Include
#the mean charges for each diagnosis, and the
#number of patients with that diagnosis.
#Use dplyr to create a list of the 10 diagnoses
#(ccsdx) that incur the highest charges, on average,
#in descending order of mean charges.Include
#the mean charges for each diagnosis, and the
#number of patients with that diagnosis.
#Using dplyr, print the mean Charges for
#males and females in the data set.
#Using base package, create a vector, called
#mcharges, that contains only the charges for males.
#Also, create a vector, called fcharges, that
#contains only the charges for females.
#Have R print the mean for males, then the mean for females
#on the console, using mean().Also print the difference
#between the two.The values should be the same as your
#dplyr gives above.
#Use ggplot to create a series of boxplots showing charges by
#age groups, so that each boxplot is a different color.
#Give your graph a title, label the y axis with
#Medical Charges, and label the legend to say
#Age Groups
#Briefly describe the trend you observe, in terms
#of center, spread, and skewness
#################################################
#IV.Inference, Test 1
#Create an ANOVA object for comparing the mean
#charges for different methods of payment.
#Summarize the model object, so that you can
#see the p-value.
#To go with the analysis, create a plot of several
#boxplots showing charges by payment method groups,
#so that each boxplot is a different color.
#Add a title, and change the y axis and legend labels.
#Summarize your results, including the p-value,
#commenting on whether this suggests a
#difference in the mean charges for different
#payment methods.
#################################################
#V.Inference, Test 2
#Use ggplot to create a density plot of charges
#by Sex, making use of facets and color.
#Describe the distribution shapes,
#and suggest a reason for the difference:
#Why might charges for females look this way,
#and charges for males look this way?
#Run an appropriate parametric statistical test
#(see hint at beginning of script)
#to determine if the mean charges are different
#for males versus females.Also find the corresponding
#(parametric) 95% CI for the difference between mean charges.
#State your CI in a sentence in terms of the problem:
#Im 95% sure that.
#Comment on the results of your statistical test.
#Is the difference statistically significant?
#What can you conclude about the true difference?
#################################################
#VI.Writing Functions
#Put your code here for a function making a 95% Bootstrap
#Confidence interval for one mean.Run your function,
#using the data on Charges.Be sure that the function
#prints the point estimate (the observed mean), and
#CI with a description:Im 95% sure that
#Using your CI function above as a start, create
#a NEW function that will find a 95% Bootstrap CI
#for the *Difference Between Two Means*.
#The user will provide two vectors with quantitative
#data (which may have missing values), and
#the function will print the observed means,
#the observed difference, and a CI for the difference
#along with descriptive text:Im 95% confident
#The procedure:
# First, remove missing values from each vector.
# Next, calculate the means, and the observed difference
# Take a bootstrap sample from each vector, separately,
# and find the difference.
# Repeat many times, and accumulate the differences in
# a vector.Calculate the CI, as we have in other functions.
#Once you have your function, apply it to
#the vectors mcharges and fcharges, that you
#created in part III above.
#################################################
#Finally, put your code in an Rmd script.
#The script should Knit successfully, to
#produce a good-looking html file,
#that shows all of the requested R code,
#results, and text.
#Your html file should begin with text briefly
#describing the data set.
#Your Rmd script should have six code chunks, each
#named as noted above.Each code chunk should have
#text:a title, and description of results,
#where requested above.
#Do include all setup code, and make sure all code chunks
#are printed in the final html document.
#Prevent all of the notifications after loading ggplot2
#and dplyr.
#Prevent the following from showing for plots:
#*Removed 6 rows containing non-finite values (stat_boxplot).*
#(It is ok if this message stays on the ANOVA output:
#*6 observations deleted due to missingness*)
Reviews
There are no reviews yet.