Instructions:
- Fill in your name and UWNetID above.
- Put answers to the questions on this document, using the 00Answers Word style so your answers are clearly distinguished from the questions.
- Create a PDF file from this document.
- Create a single zip file including this document as a PDF file, along with the RDS file and R code file.
- Upload the single zip file to Canvas.
Explanation:
For this assignment, you will be perusing some of the documentation for the Add Health Wave 1 data set. You will use the documentation to make some updates to a data frame containing some of the Add Health data, and then save the data frame as an RDS file. You will update a metadata table that partially describes the data set and changes you made to the variable names and variable labels.
To open a Stata version 13 file in R there are two main options:
- Use haven::read_dta(). To access variable labels in R use labelled::foreign_to_labelled(). To update variable labels, use the labelled::var_label()
- Use readstata13::read.dta13(). Variable labels for this format are available, e.g., for a data frame named dat as attributes(dat)$var.labels. This is a vector of text strings that can be updated by assigning a new value to the specified element, e.g.,attributes(dat)$var.labels[1] <- foo.
To save the RDS file, use the base function saveRDS().
Here is a base R code snippet that will rename a single variable:
colnames(data_frame)[grep(^original_variable_name$, colnames(data_frame))] <- new_variable_name
The grep() function finds the position of the named variable in the list of variables in the data frame. The characters ^ and $ are regular expressions to specify the start and end of the string to be matched (assuring that the pattern does not match multiple similar variable names).
It is much simpler with tidyverse and magrittr:
data_frame %<>% rename(new_variable_name = old_variable_name)
Additional hint for dealing with PDF documentation:
- Use pdfgrep (should be available in a Linux or Mac package manager; for Windows, search for a version or use Cygwin).
- Use the R pdftools This could be used in a loop over each PDF file to create a data frame with the name of the PDF file, page number, and text of each page. The str_match() function could be used to identify the file name and page number where specific text strings occur. For a minimal example, this shows that the string h1gi1m is found on page 1 of INH01PUB.PDF. Conversion of the PDF files text to lowercase simplifies the matching:
> x <- pdftools::pdf_text(pdf = INH01PUB.PDF)
> str_match(string = x %>% str_to_lower(), pattern = h1gi1m)
[,1]
[1,] h1gi1m
[2,] NA
[3,] NA
[4,] NA
[5,] NA
[6,] NA
[7,] NA
[8,] NA
[9,] NA
[10,] NA
[11,] NA
[12,] NA
[13,] NA
[14,] NA
[15,] NA
Questions:
- Explore the Add Health website (http://www.cpc.unc.edu/projects/addhealth) and answer the following questions (making sure to cite as necessary):
- What was the sampling frame for this study?
The sampling frame for the Add Health study was all high schools included in the Quality Education Database (QED). High school was defined as schools with an 11th grade and more than 30 students.
- What were the three kinds of respondents at Wave I?
- What was the instrument with the largest sample size?
- Is it possible for a respondent to be in Wave III without being in Wave II?
- What is the time span of the Add Health data collection (all waves)?
- What is the difference between the public and the restricted-use Add Health data?
- Describe a research question that you might be able to answer using the Add Health dataset.
- Download the public-use Add Health documentation at https://canvas.uw.edu/courses/1434040/files. Answer the following questions:
- In what pdf document is the documentation for the race items for the Wave I In-Home questionnaire?
- How many respondents were of Hispanic/Latino origin?
- What is the Knowledge Quiz in the Wave I In-Home questionnaire?
- What is the unique identifier for the In-home data?
- Download the Stata 13 format file AHwave1_v1.dta (http://staff.washington.edu/phurvitz/csde502_winter_2021/data/AHwave1_v1.dta).
- Fill in the grey missing cells in Table 1 below based on the data and/or documentation. Optimally, use the documentation to familiarize yourself with the structure of the code books.
- Using questions 6 and 8 in INH01PUB.PDF, create a new variable named race that uses recoded values (white = 1; black/African American = 2; American Indian = 3; Asian/Pacific Islander = 4; other = 5; unknown/missing = 9).
- Rename the variables, and update variable labels using Table 1 as a guide and save the data frame as the file as rds. Use a single R code file for your edits to the data file.
- Update the status in Table 1 as needed.
Table 1: Codebook for variables from Add Health Wave 1 data
newvariablename | originalvariablename | status* | datatype | values | newvariablelabel | codebookfilename |
aid | aid | unchanged | text | 8 digit string | unique case (student) identifier | SECTAPUB.PDF |
imonth | imonth | unchanged | integer | 14 to 12 | month interview completed | SECTAPUB.PDF |
iday | iday | unchanged | day interview completed | SECTAPUB.PDF | ||
iyear | iyear | unchanged | 94, 95 | SECTAPUB.PDF | ||
bio_sex | bio_sex | interviewer confirmed sex | ||||
bmonth | h1gi1m | birth month | INH01PUB.PDF | |||
byear | h1gi1y | birth year | ||||
hispanic | h1gi4 | renamed | Hispanic/Latino | INH01PUB.PDF | ||
white | h1gi6a | renamed | 0 = not marked1 = marked6 = refused8 = dont know | race white | INH01PUB.PDF | |
black | race black or African American | INH01PUB.PDF | ||||
AI | h1gi6c | race American Indian or Native American | INH01PUB.PDF | |||
asian | h1gi6d | race Asian or Pacific Islander | INH01PUB.PDF | |||
raceother | h1gi6e | race other | INH01PUB.PDF | |||
onerace | one category best describes racial background | INH01PUB.PDF | ||||
observedrace | h1gi9 | interviewer observed race | INH01PUB.PDF | |||
health | h1gh1 | how is your health | ||||
race | not applicable | race recoded as white; black/African American; American Indian; Asian/Pacific Islander; other; unknown/missing |
*status categories: unchanged, renamed, missing defined, derived
Reviews
There are no reviews yet.