In this assignment, youll begin the process of exploring relationships in data. Youll accomplish this task by computing some basic statistical measures on one of three datasets. This is a good time to learn or reboot your Python coding skills.
Step 1 Select one of the datasets for completion of this assignment:
- [mental-health-in-tech-survey.csv] Mental Health in Tech Survey: Survey on Mental Health in the Tech Workplace in 2014 https://osmihelp.org/research/
Dependent Variables:
- treatment: Have you sought treatment for a mental health condition? (Yes/No) o mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? (Yes/Maybe/No)
- phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? (Yes/Maybe/No)
- [diabetic_data.csv] Diabetes 130 US hospitals for years 1999-2008: Diabetes readmission https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Dependent Variables:
- time_in_hospital: a numeric value representing number of days between admission and discharge
- readmitted: Days to inpatient readmission <30 if the patient was readmitted in less than 30 days, >30 if the patient was readmitted in more than 30 days, and No for no record of readmission.
- [compas-scores-two-years.csv] COMPAS Recidivism Racial Bias: Racial Bias in inmate COMPAS reoffense risk scores for Florida (ProPublica) https://github.com/propublica/compasanalysis
Dependent Variables: o decile_score: a numeric value between 1 and 10 corresponding to the recidivism risk score generated by COMPAS software (a small number corresponds to a low risk, a larger number corresponds to a high risk).
- two_year_recid: a numeric indicator of whether the defendant recidivated two years after previous charge (0: no, did not recidivate, 1: yes, did recidivate)
Step 2 Explore the data by answering the following questions:
- Which dataset did you select?
- How many observations are in the dataset?
- How many variables in the dataset?
- Does this dataset seem to belong to a regulated domain in law as discussed in the lectures? If yes, which one?
- How many variables in the dataset are associated with a legally recognized protected class? In a table format, list those variables associated with a protected class, identify the protected class and the associated legal precedence/law as discussed in the lectures.
Example Output (associated with a different dataset) Dataset: Housing Decisions in Metro-Atlanta
Number of Observations: 1,400
Number of Variables: 16
Regulated Domain in Law: Housing (Fair Housing Act)
Number of Protected Class Variables: 2
Protected Class | Law | |
nationality | National origin | Civil Rights Act of 1964, 1991 |
pregnant (y/n) | Pregnancy | Pregnancy Discrimination Act |
Step 3 Determine the relationships between dependent and independent variables
The frequency of a value represents the number of times a value occurs in a data set. Compute the frequency of each value associated with each dependent variable (listed in Step 1) as a function of all of the protected class variables (independent variables) identified in Step 2. Create histogram(s) comparing the frequency values of the dependent variable as a function of the independent variable. Hint: For variables that are continuous, you might consider creating intervals that represent the data. For categorical/ordinal/nominal values, you might consider converting to numerical values.
Example Output for One Dependent-Independent Variable Combination:
Independent Variable Protected Class Variable | Dependent Variable Housing Decision (Y/N) |
Pregnant Y | Frequency of Y: 50 Frequency of N: 120 |
Pregnant N | Frequency of Y: 130 Frequency of N: 20 |
Step 4 Show how to manipulate with data
Select one protected class variable (independent variable) and one dependent variable. 1) Create a graph to support the fairness hypothesis: The system is fair. There is no difference in the outcomes. 2) Create a graph to support the bias hypothesis: The system is biased. There is a difference in the outcomes. For each, provide a brief description of your manipulations.
Example Output:
- Fair Hypothesis: As seen from this graph, housing decisions are not dependent on the pregnancy status of women. [Manipulations: Used line graph; Increased Scale to +-50; Mapped the ratio of positive Y decisions (i.e. 50/180 versus 130/180); No label on the Y-Axis].
Difference in Housing Decisions Based on Pregnancy
- Bias Hypothesis: As seen from this graph, housing decisions are significantly dependent on the pregnancy status of women. [This hypothesis was easily supported with the data so didnt require much in manipulations: Used stacked bar graph; Reduced Scale; Reworded labels].
Step 5: Given your selected protected class variable (independent variable), calculate the average (mean, median, and mode) values of the protected class group (Hint: Variables might need to be converted to numerical values as needed). Run the random sampling method using 50% of the data to create a reduced dataset. Calculate the average (mean, median, and mode) values of the protected class group. Indicate if there is a difference (or not) between the original dataset and the reduced dataset for any of the averages. Provide all results.
Protected Class Variable (Pregnant) | Mean | Median | Mode |
Original Data Set | 0 (NO) | 0 (NO) | 0 (NO) |
Reduced Data Set | 0 (NO) | 1 (YES) | 0 (NO) |
Difference | No Difference | Difference | No Difference |
Step 6: Given your reduced dataset from Step 5, Repeat Step 3 (frequency and histogram) using your selected independent variable as a function of your selected dependent variable (from Step 4). Explain any differences (in no more than 2 sentences). If you used the random sampling method, would members associated with the protected class variable benefit or be harmed? Explain your reasoning (in no more than 2 sentences).
Step 7: Turn in a report documenting your outputs.
Reviews
There are no reviews yet.