For this assignment, you are responsible for answering the below questions based on the dataset provided. You will then need to submit a 2-page report in which you present the results of your analysis. In your report, you should use visual forms to present your results. How you decide to present your results (i.e. with tables/plots/etc.) is up to you but your choice should make the results of your analysis clear and obvious. In your report, you will need to explain what you have used to arrive at the answer to the research question and why it was appropriate for the data/question. You must interpret your final results in the context of the dataset for your problem.
Dataset:
Kaggle has hosted an open data scientist competition in 2020 titled Kaggle ML & DS Survey Challenge. The purpose of this challenge was to tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. More information on the competition, data, and prizes can be found on: https://www.kaggle.com/c/kaggle-survey-2020/data
The dataset provided (kaggle_survey_2020_responses.csv) contains the survey results provided by Kaggle. The survey results from 20036 participants are shown in 355 columns, representing survey questions. Not all questions are answered by each participant, and responses contain various data types.
In the dataset for Assignment 1, column Q24 What is your current yearly compensation (approximate $USD)? contains a numerical target variable. Rows with null salaries have been dropped. (Please refer to clean_kaggle_data.csv). You should work with the clean dataset for this assignment.
Questions:
The objectives of this Assignment is to explore the survey data to understand (1) the nature of womens representation in Data Science and Machine Learning and (2) the effects of education on income level. The following tasks should be completed:
- Perform exploratory data analysis to analyze the survey dataset and to summarize its main characteristics. Present 3 graphical figures that represent different trends in the data. For your explanatory data analysis, you can consider Country, Age, Education, Professional Experience, and Salary.
- Estimating the difference between average salary (Q24) of men vs. women (Q2).
- Compute and report descriptive statistics for each group (remove missing data, if necessary).
- If suitable, perform a two-sample t-test with a 0.05 threshold. Explain your rationale.
- Bootstrap your data for comparing the mean of salary (Q24) for the two groups. Note that the number of instances you sample from each group should be relative to its size. Use 1000 replications. Plot two bootstrapped distributions (for men and women) and the distribution of the difference in means.
- If suitable, perform a two-sample t-test with a 0.05 threshold on the bootstrapped data. Explain your rationale.
- Comment on your findings.
- Select highest level of formal education (Q4) from the dataset and repeat steps a to e, this time use analysis of variance (ANOVA) instead of t-test for hypothesis testing to compare the means of salary for three groups (Bachelors degree, Doctoral degree, and Masters degree) [75pts for a; 0.5 pts for b; 2pts for c; 0.75 pts for d; 1pt for e].
Reviews
There are no reviews yet.