1. Introduction
In this homework you need to practice some basic usages of MySQL, including creating databases, creating tables, loading csv files, loading SQL files, and using MySQL command to find the answer of tasks. After this homework, you will be capable of querying and analyzing your data by MySQL from zero to one.
There will be two datasets for this homework. The first one is COVID-19 data in South Korea, and the second one is European Soccer Dataset, both of these are downloaded from Kaggle Dataset (feel free to google them).
For the first dataset, you need to create a database based on our setting, and load the csv files into your created database. There are 6 easier questions you need to solve by SQL. For the second dataset, you need to load our sql file directly. There are 6 advanced problems you will meet. Read the following content for more details.
2. Tasks Part 1 Data Science for COVID-19 in South Korea
In this part, you need to create tables based on the provided DB schema, and load csv files into the database.
A. Create Tables
First, download the COVID-19 data from here.
You can refer to this page to see the meaning of the columns.
Then you should create tables based on the following setting. Notice that you must make the detail of your tables the same as our description, including
table name, attribute name, attribute type, primary key, foreign key, null.
Please paste the screenshot of your tables by using the `describe` command to your report, it will take 5% of your grades in this homework.
Table Name | Attribute Name | Type | Primary Key | Foreign Key | NULL |
patient_info | patient_id | varchar(10) | YES | NO | |
sex | varchar(10) | ||||
age | int |
province | varchar(20) | ||||
city | varchar(20) | ||||
infection_case | varchar(100) | ||||
search_trend | date | date | YES | NO | |
cold | float | ||||
flu | float | ||||
pneumonia | float | ||||
coronavirus | float | ||||
time | date | date | YES | NO | |
test | int | ||||
negative | int | ||||
confirmed | int | ||||
released | int | ||||
deceased | int | ||||
time_age | date | date | YES | NO | |
age | int | YES | NO | ||
confirmed | int | ||||
deceased | int | ||||
time_gender | date | date | YES | NO | |
sex | varchar(10) | YES | NO | ||
confirmed | int | ||||
deceased | int | ||||
time_province | date | date | YES | NO | |
province | varchar(20) | YES | NO | ||
confirmed | int | ||||
released | int | ||||
deceased | int | ||||
region | code | int | YES | NO | |
province | varchar(20) | ||||
city | varchar(20) | ||||
elementary_school_co unt | int | ||||
kindergarten_count | int | ||||
university_count | int | ||||
elderly_population_rat io | float | ||||
elderly_alone_ratio | float | ||||
nursing_home_count | int | ||||
weather | code | int | YES | region (code) | NO |
date | date | YES | NO | ||
avg_temp | float | ||||
most_wind_direction | int | ||||
avg_relative_humidity | float |
Please answer the following question in your report, it will take 10% of this homework.
- (3%) What is the difference between type char and type varchar?
- (3%) How many bytes it should take for tinyint, smallint, mediumint,
int? (e.g. 8 bytes for bigint)
And whats the range they can express? (e.g. from -1000 to 1000) tinyint, smallint, mediumint, int
(e.g. 8 bytes for bigint)
(e.g. from -1000 to 1000)
- (4%) What do you think about this DB schema? If you can change this table architecture, how would you modify it and why?
B. Load CSV Data
After creating the database, you need to load the downloaded csv files into your database.
Here we dont restrict the method you use, but you have to check the data is loaded successfully by yourself. The following number is the data records for each table.
Table Name | # of Data Records |
patient_info | 5164 |
search_trend | 1642 |
time | 163 |
time_age | 1089 |
time_gender | 242 |
time_province | 2771 |
region | 243 |
weather | 26271 |
C. Query Tasks
In this part, here are 6 query tasks you need to write. Please read the following rules carefully.
You are only allowed to use one query (one delimiter) to find the answer, and you dont have to explain your SQL. Noted that the column names of your query answers should be the same as our examples.
For homework submission, please write every query task into a single `sql` file, named as 1.sql, 2.sql, etc.
- (5%) How many days have the word cold been searched over 2000 times in one day? cold
cnt |
5566 |
- (5%) How many ways are there for those men under 30, living in Seoul Gangnam-gu being infected? List out in the alphabetical order as the following example.
infection_case |
eating |
talking |
sleeping |
- (5%) Find out the province, city, and the elementary school count in which the elementary school count is the top three most and the name of the province is different from the city. List out in decreasing order of the count.
province | city | cnt |
Apple | Taipei | 5566 |
Banana | Hsinchu | 3344 |
Cherry | Taichung | 1122 |
- (5%) Find out the provinces whose days count of the average relative humidity larger than 70 are the top three most in May of 2016, and list their days count in decreasing order.
- (5%) Find out the province where the elderly population ratio is larger than the average and the date with maximum confirmed in one day. List out in the order of date increasingly. Notice that the average of the elderly population ratio is the average of those provinces which have the same name as the city.
province | date |
Apple | 2020-01-01 |
Banana | 2020-02-02 |
- (5%) How many accumulated-confirmed, added-confirmed,
accumulated-dead, added-dead are there while the search number of the word coronavirus is larger than two standard deviations? List your answer in ascending order by date and round coronavirus to second decimal place. The standard deviation should be calculated by the period from 2019-12-15 to 2020-06-29.
The added-confirmed and the added-dead should be calculated by the accumulated count of that day minus the accumulated count of the previous day. coronavirus
date | coronavirus | confirmed_accumulate | confirmed_add | dead_accumulate | dead_add |
2020-01-01 | 66.00 | 3 | 0 | 12 | 0 |
2020-01-02 | 55.12 | 23 | 20 | 34 | 22 |
Part 2 European Soccer Database
In this Part, instead of creating tables and loading csv files, you need to load the provided DB file directly. Please download the file from here.
D. Load SQL File
Here we provide simple steps for the Linux environment. You can also load the SQL file by other methods, like execute the sql file using the source command.
- Firstly, create a database
- Then, back to your shell and enter command mysql -u {user_name} -p {DB Name} < hw1_part2.sql
(You can google IO Redirection for more detail of the above mechanism)
E. Query Tasks
In this part, you are also only allowed to use one query (one delimiter) to find the answer, and you dont have to explain your SQL, except task 11 and task 12. Noted that the column names of your query answers should be the same as our examples.
For task 11 and 12, take screenshots of your queries, and write your analysis into the report. Try to explain whatre your queries doing, why you write these queries, whats the meaning of the result, whats your conclusion, etc.
For submission, also write every query task into a single `.sql` file, named as 7.sql,
8.sql , 11.sql, 12.sql
For the meaning of each table and column in part2, please refer to this page.
- (10%) List the average long_shots score(round to the second decimal place) of the players who had participated in the Italy Serie A league during 2015/2016 season with respect to the preferred foot. You should calculate the long_shots score by the newest data of each player.
preferred_foot | avg_long_shots |
left | 30.87 |
right | 20.87 |
- (10%) During the 2015/2016 season, for each of the leagues, if we have known that the average height of members in one team is over 180, what is the probability that the team can win?
The numerator is the winning count of those over-180-teams, and the denominator is the count of those over-180-teams. Round the probability to fourth decimal place. List out in the alphabetical order.
(e.g. In two matches, A and B, one team of A is over-180-teams and wins the game. Both team of B are over-180-teams and one of the team win the game, then the win probability is )
(e.g. In two matches, A and B, one team of A is over-180-teams and ties the game. Both team of B are not over-180-teams, then the win probability is 0) (You need to calculate the record of the same team in different matches. For example, team a1 and team a2 take match A, and team a1 and team b1 take match
B, then you need to consider a1 multiple times)
name | prob |
AppleLeague | 0.5566 |
BananaLeague | 0.3344 |
- (10%) The win point can be calculated by the following rule: For each match, the winning team will get two points. The loser will get zero point. If the match is a draw, both of the teams will get one point. The win points of each team is the point divided by the match count the team participating in during the whole season. The top five teams with the highest win points are called the greats of the season. Find out the average winning score (round to the second decimal place) and the teams long name of the greats of the season during the 2015/2016 season. List out in the decreasing order of their win points.
team_long_name | avg_win_score |
AppleTeam | 5.55 |
BananaTeam | 4.44 |
CherryTeam | 3.33 |
JellyTeam | 2.22 |
OreoTeam | 1.11 |
- (15%) We call it a landslide victory if there is a larger than or equal to five score gap between two teams in a match. And we call it an upset if any one of the sports betting companies has a higher betting odds on a team with landslide victory. Find out that for an upset, what is the average age of the player at that time and the average rating of the players of each team from the previous six months? Round the score to second decimal place and list out in the order of
match id increasingly.
(If a player has scores multiple times, please average all of them)
(Here we want to see the average data of players on the home side and away side in the upset matches. You dont need to consider which team the team belongs to)
id | home_player_avg_age | away_player_avg_age | home_player_avg_rating | away_player_avg_rating |
1 | 22.22 | 23.23 | 60.60 | 58.58 |
2 | 23.23 | 24.24 | 62.62 | 75.75 |
3 | 24.24 | 21.21 | 55.55 | 99.99 |
- (10%) Do the home team with home advantage has much more opportunity to win the game, or the team with a higher average score(which can be one of the overall_rating, dribbling, strength, interceptions, or the average of the four scores, calculated by the latest attribute before the player participating in the match) of the whole team players? Answer by your own view with one SQL query.
- (10%) You are a gambler of sport lottery. Analyzing this dataset with SQL and finding out the better way to place a bet. You can answer by your own view with multiple SQL queries. Focus on observation to the dataset and explain your analysis.
4. Discussion
TAs had opened a channel HW1 on New E3 forum of the course, you can post questions about the homework on the forum. TAs will answer questions as soon as possible.
Discussion rules:
- Do not ask for the answer to the homework.
- Check if someone has asked the same question before asking.
- We encourage you to answer other students questions, but again, do not give the answer of the homework. Reply the messages to answer questions.
- Since we have this discussion forum, do not send email to ask questions about the homework unless the questions are personal and you do not want to ask publicly.
Reviews
There are no reviews yet.