You are required to design a prototype movie recommendation program in Python as detailed in the
assignment description.You will work individually.
We will NOT be using Autolab in this class for submitting assignments. This means your program will
only be graded once, after the submission deadline has passed.
Make sure you abide by the DCS Acadmic Integrity Policy for Programming Assignments.What to submit
Write all the required functions described below in the given template file named hw1.py
Fill in the function code for all the functions in this template. Note: Every function has a pass statement
because Python does not allow empty functions. The pass statement does nothing, it’s simply a filler.
You can delete it when you write in your code.You may implement other helper functions as needed.
Make sure to test your programs on files other than the samples we have provided, to cover the
various paths of logic in your code.
Make sure you write ALL your test calls in the main() function. Do NOT write ANY code outside of any of
the functions.When you are done writing and testing, submit ONLY the filled-in hw1.py file to Canvas.
Do NOT submit a Jupyter notebook, it will not be accepted for grading.
You are allowed up to 5 submissions, only the last submission will be graded.How to test your code
You can test your program by calling your functions in the hw1.py file from the main() function.
All test code must be in the main() function ONLY. If you write ANY code outside of any of the functions,
you will lose credit.In Terminal, execute your program like this:
> python hw1.pyThis was explained in class: see jan22_notes.ipynb/jan22_notes.html between cells #23 and #24 –
Writing and executing standalone (outside Jupyter notebook) Python programs.Make sure the test files are in the same folder as the program. You may develop and test your code in a
Jupyter notebook, but for submission you will need to move your code over to hw1.py and execute it as
above to make sure it works correctly. The process of moving code from a Jupyter notebook to a Python
file has been explained in class.You can run your tests on the given ratings and movies files, but testing on only these files may not be
sufficient. You should make your own test files as well, to make sure that you cover the various paths of
logic in your functions. You are not required to submit any of your test files.You may assume that all parameter values to your functions will be legitimate, so you are not required to
check whether the parameter values are valid.
In any function that requires the returned values to be sorted or ranked, ties may be broken arbitrarily
between equal values.You may retain the main() function when submitting, but we will IGNORE it.For this assignment only, grading will be done by an AUTOGRADER program that will call the functions
in your hw1.py, and check the returned value against the expected correct value. It will NOT call your
main() function.The AUTOGRADER does not look at printed output, so anything you print in your program will be
ignored. There will not be any manual inspection of code, credit is based solely on whether your
functions return correct results.Partial Credit
There is no partial credit for code structure, etc. Credit is given only when correct values are returned
from your functions. However, each function will be tested on several cases. So for instance, if a function
runs correctly on 2 out of 3 test cases, you will get full points for the 2 cases and zero for the third. (In
this sense, there is partial credit for each function.)Data Input
• Ratings file: A text file that contains movie ratings. Each line has the name (with year) of a movie,
its rating (range 0-5 inclusive), and the id of the user who rated the movie. A movie can have
multiple ratings from different users. A user can rate a particular movie only once. A user can
however rate multiple movies. Here’s a sample ratings file.• Movies file: A text file that contains the genres of movies. Each line has a genre, a movie id, and
the name (with year) of the movie. To keep it simple, each movie belongs to a single genre. Here’s
a sample movies file.Note: A movie name includes the year, since it’s possible different movies have the same title, but
were made in different years. However, no two movies will have the same name in the same year.
You may assume that input files will be correctly formatted, and data types will be as expected. So you
don’t need to write code to catch any formatting or data typing errors.
For all computation of rating, do not round up (or otherwise modify) the rating unless otherwise specified.Implementation1. [10 pts] Write a function read_ratings_data(f) that takes in a ratings file name, and returns a
dictionary. (Note: the parameter is a file name string such as “myratings.txt”, NOT a file pointer.)
The dictionary should have movie as key, and the list of all ratings for it as value.
For example: movie_ratings_dict = { “The Lion King (2019)” : [6.0, 7.5, 5.1],
“Titanic (1997)”: [7] }2. [10 pts] Write a function read_movie_genre(f) that takes in a movies file name and returns a
dictionary. The dictionary should have a one-to-one mapping from movie to genre.
For example { “Toy Story (1995)” : “Adventure”, “Golden Eye (1995)” : “Action”
}
Watch out for leading and trailing whitespaces in movie name and genre name, and remove them before
storing in the dictionary.1. [8 pts] Genre dictionary
Write a function create_genre_dict that takes as a parameter a movie-to-genre dictionary, of
the kind created in Task 1.2. The function should return another dictionary in which a genre is
mapped to all the movies in that genre.
For example: { genre1: [ m1, m2, m3], genre2: [m6, m7] }2. [8 pts] Average Rating
Write a function calculate_average_rating that takes as a parameter a ratings dictionary, of
the kind created in Task 1.1. It should return a dictionary where the movie is mapped to its
average rating computed from the ratings list.
For example: {“Spider-Man (2002)”: [3,2,4,5]} ==> {“Spider-Man (2002)”: 3.5}1. [10 pts] Popularity based
In services such as Netflix and Spotify, you often see recommendations with the heading “Popular
movies” or “Trending top 10”.Write a function get_popular_movies that takes as parameters a dictionary of movie-to-average
rating ( as created in Task 2.2), and an integer n (default should be 10). The function should return
a dictionary ( movie:average rating, same structure as input dictionary) of top n movies based on
the average ratings. If there are fewer than n movies, it should return all movies in ranked order of
average ratings from highest to lowest.2. [8 pts] Threshold Rating
Write a function filter_movies that takes as parameters a dictionary of movie-to-average rating
(same as for the popularity based function above), and a threshold rating with default value of 3.The function should filter movies based on the threshold rating, and return a dictionary with same
structure as the input. For example, if the threshold rating is 3.5, the returned dictionary should
have only those movies from the input whose average rating is equal to or greater than 3.5.3. [12 pts] Popularity + Genre based
In most recommendation systems, genre of the movie/song/book plays an important role. Often,features like popularity, genre, artist are combined to present recommendations to a user.
Write a function get_popular_in_genre that, given a genre, a genre-to-movies dictionary (as
created in Task 2.1), a dictionary of movie:average rating (as created in Task 2.2), and an integer
n (default 5), returns the top n most popular movies in that genre based on the average ratings.
The return value should be a dictionary of movie-to-average rating of movies that make the cut. If
there are fewer than n movies, it should return all movies in ranked order of average ratings from
highest to lowest.Genres will be from those in the movie:genre dictionary created in Task 1.2. The genre name will
exactly match one of the genres in the dictionary, so you do not need to do any upper or lower
case conversion.One important analysis for content platforms is to determine ratings by genre.
Write a function get_genre_rating that takes the same parameters as get_popular_in_genre
above, except for n, and returns the average rating of the movies in the given genre.Write a function genre_popularity that takes as parameters a genre-to-movies dictionary (as
created in Task 2.1), a movie-to-average rating dictionary (as created in Task 2.2), and n (default
5), and returns the top-n rated genres as a dictionary of genre:average rating. If there are fewer
than n genres, it should return all genres in ranked order of average ratings from highest to
lowest. Hint: Use the above get_genre_rating function as a helper.1. [10 pts] Read the ratings file to return a user-to-movies dictionary that maps user ID to a list of the
movies they rated, along with the rating they gave. Write a function named read_user_ratings
for this, with the ratings file as the parameter.For example: { u1: [ (m1, r1), (m2, r2) ], u2: [ (m3, r3), (m8, r8) ] }
where ui is user ID, mi is movie, ri is corresponding rating. You can handle user ID as int or
string type, but make sure you consistently use it as the same type everywhere in your code.2. [12 pts] Write a function get_user_genre that takes as parameters a user id, the user-to-movies
dictionary (as created in Task 4.1 above), and the movie-to-genre dictionary (as created in Task
1.2), and returns the top genre that the user likes based on the user’s ratings. Here, the top genre
for the user will be determined by taking the average rating of the movies genre-wise that the user
has rated. If multiple genres have the same highest ratings for the user, return any one of genres
(arbitrarily) as the top genre.3. [12 pts] Recommend 3 most popular (highest average rating) movies from the user’s top genre
that the user has not yet rated. Write a function recommend_movies for this, that takes as
parameters a user id, the user-to-movies dictionary (as created in Task 4.1 above), the movie-togenre dictionary (as created in Task 1.2), and the movie-to-average rating dictionary (as created in
Task 2.2).The function should return a dictionary of movie-to-average rating. If fewer than 3
movies make the cut, then return all the movies that make the cut in ranked order of average
ratings from highest to lowest.Given a CSV data file as represented by the sample file pokemonTrain.csv, perform the following operations on it.
1. [7 pts] Find out what percentage of “fire” type pokemons are at or above the “level” 40.
(This is percentage over fire pokemons only, not all pokemons)Your program should print the value as follows (replace … with value):
Percentage of fire type Pokemons at or above level 40 = …
The value should be rounded off (not ceiling) using the round() function. So, for instance, if the value is 12.3 (less than or
equal to 12.5) you would print 12, but if it was 12.615 (more than 12.5), you would print 13, as in:
Percentage of fire type Pokemons at or above level 40 = 13
Do NOT add % after the value (such as 13%), only print the numberPrint the value to a file named “pokemon1.txt”
If you do not print to a file, or your output file name is not exactly as required, you will get 0 points.2. [10 pts] Fill in the missing “type” column values (given by NaN) by mapping them from the corresponding “weakness”
values.You will see that typically a given pokemon weakness has a fixed “type”, but there are some exceptions. So, fill in
the “type” column with the most common “type” corresponding to the pokemon’s “weakness” value.For example, most of the pokemons having the weakness “electric” are “water” type pokemons but there are other types
too that have “electric” as their weakness (exceptions in that “type”). But since “water” is the most common type for
weakness “electric”, it should be filled in.
In case of a tie, use the type that appears first in alphabetical order.3. [13 pts] Fill in the missing (NaN) values in the Attack (“atk”), Defense (“def”) and Hit Points (“hp”) columns as follows:
a. Set the pokemon level threshold to 40.
b. For a Pokemon having level above the threshold (i.e. > 40), fill in the missing value for atk/def/hp with the average
values of atk/def/hp of Pokemons with level > 40.So, for instance, you would substitute the missing “atk” value for
Magmar (level 44), with the average “atk” value for Pokemons with level > 40. Round the average to one decimal
place.c. For a Pokemon having level equal to or below the threshold (i.e. <= 40), fill in the missing value for atk/def/hp with
the average values of atk/def/hp of Pokemons with level <= 40. Round the average to one decimal place.After performing #2 and #3, write the modified data to another csv file named “pokemonResult.csv”.
This result file should have all of the rows from the input file – rows that were modified as well as rows that were
not modified.If you do not write the modified data to another CSV file, or your output file name is not exactly as required, you
will get 0 points.The following tasks (#4 and #5) should be performed on the pokemonResult.csv file that resulted above.4. [10 pts] Create a dictionary that maps pokemon types to their personalities. This dictionary would map a string to a list of
strings. For example:
{“fire”: [“docile”, “modest”, …], “normal”: [“mild”, “relaxed”, …], …}Your dictionary should have the keys ordered alphabetically, and also items ordered alphabetically in the values list, as
shown in the example above.
Print the dictionary in the following format:
Pokemon type to personality mapping:
normal: mild, relaxed, …
fire: docile, modest, …
…
Print the dictionary to a file named “pokemon4.txt”
If you do not print to a file, or your output file name is not exactly as required, you will get 0 points.5. [5 pts] Find out the average Hit Points (“hp”) for pokemons of stage 3.0.
Your program should print the value as follows (replace … with value):
Average hit point for Pokemons of stage 3.0 = …
You should round off the value, like in #1 above.
Print the value to a file named “pokemon5.txt”If you do not print to a file, or your output file name is not exactly as required, you will get 0 points.Testing
We will be testing your code with other similarly formatted files with different values. These test files will renamed as
pokemonTrain.csv when running your code, so your code doesn’t need to accommodate different file names when testing.
(Likewise, you should test your code with similarly formatted test files, renaming them as pokemonTrain.csv before testing.)It is ok to have other intermediate files generated during the creation of the result files, we will just ignore all files other than the
required result files.Given a Covid-19 data CSV file with 12 feature columns, perform the tasks given below. Use the sample file covidTrain.csv to test
your code.1. [5 pts] In the age column, wherever there is a range of values, replace it by the rounded off average value. E.g., for 10-14
substitute 12. (Rounding should be done like in 1.1). You might want to use regular expressions here, but it is not required.2. [6 pts] Change the date format for the date columns – date_onset_symptoms, date_admission_hospital and
date_confirmation from dd.mm.yyyy to mm.dd.yyyy. Again, you can use regexps here, but it is not required.3. [7 pts] Fill in the missing (NaN) “latitude” and “longitude” values by the average of the latitude and longitude values for the
province where the case was recorded. Round the average to 2 decimal places.
4. [7 pts] Fill in the missing “city” values by the most occurring city value in that province. In case of a tie, use the city that
appears first in alphabetical order.5. [10 pts] Fill in the missing “symptom” values by the single most frequent symptom in the province where the case was
recorded. In case of a tie, use the symptom that appears first in alphabetical order.Note: While iterating through records, if you come across multiple symptoms for a single record, you need to consider them
individually for frequency counts.
Watch out!: Some symptoms could be separated by a ‘; ‘ , i.e., semicolon plus space and some by ‘;’ , i.e., just a semicolon,
even within the same record. For example:
“fever; sore throat;cough;weak; expectoration;muscular soreness”Also, the symptoms column has values such as “fever 37.7 C” and “fever (38-39 C)”. For these values, you shouldn’t do any
special processing, so the symptoms should be extracted as “fever 37.7 C” and “fever (38-39 C)”, as presented in the data.After performing all these tasks, write the whole data back to another CSV file named “covidResult.csv”.This result file should have all of the rows from the input file – rows that were modified as well as rows that were not
modified.
If you do not write data back to another CSV file, or your output file name is not exactly as required, you will get 0
points.Testing
We will be testing your code with other similarly formatted files with different values. These test files will renamed as
covidTrain.csv when running your code, so your code doesn’t need to accommodate different file names when testing.(Likewise, you should test your code with similarly formatted test files, renaming them as covidTrain.csv before testing.)
It is ok to have other intermediate files generated during the creation of the result file, we will just ignore all files other than the
required result files.For this problem, you are given a set of documents (text files) on which you will perform some preprocessing tasks, and then
compute what is called the TF-IDF score for each word. The TF-IDF score for a word is measure of its importance within the
entire set of documents: the higher the score, the more important is the word.The input set of documents must be read from a file named “tfidf_docs.txt”. This file will list all the documents (one per line) you
will need to work with. For instance, if you need to work with the set “doc1.txt”, “doc2.txt”, and “doc2.txt”, the input file
“tfidf_docs.txt” contents will look like this:
doc1.txt
doc2.txt
doc2.txt• Part 1: Preprocessing (30 pts)
For each document in the input set, clean and preprocess it as follows:
1. [15 pts] Clean.
▪ Remove all characters that are not words or whitespaces. Words are sequences of letters (upper and lower
case), digits, and underscores.▪ Remove extra whitespaces between words. e.g., “Hello World! Let’s learn Python!”, so that there is exactly
one whitespace between any pair of words.
▪ Remove all website links. A website link is a sequence of non-whitespace characters that starts with either
“http://” or “https://”.
▪ Convert all the words to lowercase.
The resulting document should only contain lowercase words separated by a single whitespace.2. [7 pts] Remove stopwords.
From the document that results after #1 above, remove “stopwords”. These are the non-essential (or “noise”) words
listed in the file stopwords.txt3. [8 pts] Stemming and Lemmatization.This is a process of reducing words to their root forms. For example, look at the following reductions: run, running,
runs → run. All three words capture the same idea ‘run’ and hence their suffixes are not as important.
(If you would like to get a better idea, you may want read this article. This is completely optional, you can do the
assignment without reading the article.)Use the following rules to reduce the words to their root form:
a. Words ending with “ing”: “flying” becomes “fly”
b. Words ending with “ly”: “successfully” becomes “successful”
c. Words ending with “ment”: “punishment” becomes “punish”
These rules are not expected to capture all the edge cases of Stemming in the English language but are intended to
give you a general idea of the preprocessing steps in NLP (Natural Language Processing) tasks.
After performing #1, #2, and #3 above for each input document, write the modified data to another text file with the
prefix “preproc_”. For instance, if the input document is “doc1.txt”, the output should be “preproc_doc1.txt”.If you do not print to a file, or your output file name is not exactly as required, you will get 0 points.• Part 2: Computing TF-IDF Scores (30 pts)
Once preprocessing is performed on all the documents, you need to compute the Term Frequency(TF) — Inverse
Document Frequency(IDF) score for each word.
What is TF-IDF?
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse
document frequency, is a numerical statistic that is intended to reflect how
important a word is to a document in a collection or corpus.
Resources:
◦ TFIDF Python Example
◦ tf-idf Wikipedia Page
◦ TF-IDF/Term Frequency Technique
Steps:
a. For each preprocessed document that results from the preprocessing in Part 1, compute frequencies of all the
distinct words in that document only. So if you had 3 documents in the input set, you will compute 3 sets of word
frequencies, one per document.
b. Compute the Term Frequency (TF) of each distinct word (also called term) for each of the preprocessed documents:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)Note: The denominator, total number of terms, is the sum total of all the words, not just unique instances. So if a
word occurs 5 times, and the total number of words in a document is 100, then TF for that word is 5/100.
c. Compute the Inverse Document Frequency (IDF) of each distinct word for each of the preprocessed documents.
IDF is a measure of how common or rare a word is in a document set (a set of preprocessd text files in this case). It
is calculated by taking the logarithm of the following term:
IDF(t) = log((Total number of documents) / (Number of documents the word is found in)) + 1
Note: The log here uses base e. And 1 is added after the log is taken, so that the IDF score is guaranteed to be nonzero.d. Calculate TF-IDF score: TF * IDF for each distinct word in each preprocessed document. Round the score to 2
decimal places.
e. Print the top 5 most important words in each preprocessed document according to their TF-IDF scores. The higher
the TF-IDF score, the more important the word. In case of ties in score, pick words in alphabetical order.You should
print the result as a list of (word,TF-IDF score) tuples sorted in descending TF-IDF scores. See the Testing section
below, in files tfidf_test1.txt and tfidf_test2.txt, for the exact output format.
Print to a file prefixed with “tfidf_”. So if the initial input document was “doc1.txt”, you should print the TFIDF results to “tfidf_doc1.txt”.
If you do not print to a file, or your output file name is not exactly as required, you will get 0
points.
Testing:1. You can begin with the following three sentences as separate documents against which to test your code:
◦ #d1 = “It is going to rain today.”
◦ #d2 = “Today I am not going outside.”
◦ #d3 = “I am going to watch the season premiere.”You can match values computed by your code with this same example in the TF-IDF/Term Frequency Technique page
referenced above. Look for it under “Let’s cover an example of 3 documents” on this page. (Note: We are adding 1 to the
log for our IDF computation.)2. Next, you can test your code against test1.txt and test2.txt. Compare your resulting preprocessed documents with our
results in preproc_test1.txt and preproc_test2.txt, and your TF-IDF results with our results in tfidf_test1.txt and
tfidf_test2.txt.3. Finally, you can try your code on these files: covid_doc1.txt, and covid_doc2.txt, and covid_doc3.txt. Results for these are
not provided, however the files are small enough that you can identify the words that make the cut and manually compute
TF-IDF.Note: When we test your submissionn, the input test file will be named tfidf_docs.txt, as stated at the start of Part 3.
However, the file names contained in this file may be different than the samples given above (i.e. “doc1.txt”, “doc2.txt”, ..) so
make sure that your code can read whatever file names appear in the tfidf_docs.txt file and work on them.Given a CSV data file as represented by the sample file EuCitiesTemperatures.csv (213 records), load it
into a Pandas DataFrame and perform the following tasks on it.Preprocessing/Analysis (28 pts)
1. [9 pts] Fill in the missing latitude and longitude values by calculating the average for that country.
Round the average to 2 decimal places.2. [9 pts] Find out the subset of cities that lie between latitudes 40 to 60 (both inclusive) and
longitudes 15 to 30 (both inclusive). Find out which countries have the maximum number of cities
in this geographical band. (More than one country could have the maximum number of values.)3. [10 pts] Fill in the missing temperature values by the average temperature value of the similar
region type. A region type would be a combinaton of whether it is in EU (yes/no) and whether it
has a coastline (yes/no).For example, if we have a missing temperature value for Bergen, Norway, which is not in the EU
but lies on the coast, we will fill it with the average temperature of cities with EU=’no’ and
coastline=’yes’)Visualization (27 pts)
For all plots, make sure to label the axes, and set appropriate tick labels.
1. [6 pts] Plot a bar chart for the number of cities belonging to each of the regions described in
Preprocessing/Analysis #3 above.2. [7 pts] Plot a scatter plot of latitude (y-axis) v/s longitude (x-axis) values to get a map-like visual of
the cities under consideration. All the cities in the same country should have the same color.3. [6 pts] The population column contains values unique to each country. So two cities of the same
country will show the same population value. Plot a histogram of the number of countries
belonging to each population group: split the population values into 5 bins (groups).4. [8 pts] Plot subplots (2, 2), with proper titles, one each for the region types described in
Preprocessing/Analysis #3 above.
Each subplot should be a scatter plot of Latitude (y-axis) vs. City (x-axis), where the color of the
plot points should be based on the temperature values: ‘red’ for temperatures above 10, ‘blue’ for
temperatures below 6 and ‘orange for temperatures between 6 and 10 (both inclusive). For each
subplot, set xticks to an array of numbers from 0 to n-1 (both inclusive), where n is the total
number of cities in each region type. This represents each city as a number between 0 and n-1.Given a CSV data file as represented by the sample file GermanCredit.csv (1000 records), load it into a
Pandas DataFrame, and perform the following tasks on it.Preprocessing (31 pts)
1. [8 pts] Drop the 3 columns that contribute the least to the dataset. These would be the columns
with the highest number of non-zero ‘none’ values. Break ties by going left to right in columns.
(Your code should be generalizable to drop n columns, but for the rest of the analysis, you can call
your code for n=3.)2. [4 pts] Certain values in some of the columns contain unnecessary apostrophes (‘). Remove the
apostrophes.3. [5 pts] The checking_status column has values in 4 categories: ‘no checking’, ‘<0’, ‘0<=X<200’,
and ‘>=200’. Change these to ‘No Checking’, ‘Low’, ‘Medium’, and ‘High’ respectively.
4. [5 pts] The savings_status column has values in 4 categories: ‘no known savings’, ‘<100’,
‘100<=X<500’, ‘500<=X<1000’, and ‘>=1000’. Change these to ‘No Savings’, ‘Low’, ‘Medium’,
‘High’, and ‘High’ respectively. (Yes, the last two are both ‘High’).5. [4 pts] Change class column values from ‘good’ to ‘1’ and ‘bad’ to ‘0’.
6. [5 pts] Change the employment column value ‘unemployed’ to ‘Unemployed’, and for the others,
change to ‘Amateur’, ‘Professional’, ‘Experienced’ and ‘Expert’, depending on year range.Analysis (17 pts)
For the following tasks, do preprocessing or changing of data types in the data frame as required.
1. [5 pts] Often we need to find correlations between categorical attributes, i.e. attributes that have
values that fall in one of several categories, such as “yes”/”no” for attr1, or “low”,”medium”,”high”
for attr2.One such correlation is to find counts in combinations of categorial values across attributes, as in
how many instances are “yes” for attr1 and “low” for attr2. A good way to find such counts is to
use the Pandas crosstab function. Do this for the following two counts.a. [3 pts] Get the count of each category of foreign workers (yes and no) for each class of
credit (good and bad).
b. [2 pts] Similarly, get the count of each category of employment for each category of
saving_status.2. [4 pts] Find the average credit_amount of single males that have 4<=X<7 years of employment.
You can leave the raw result as is, no need for rounding.
3. [4 pts] Find the average credit duration for each of the job types. You can leave the raw result as
is, no need for rounding.4. [4 pts] For the purpose ‘education’, what is the most common checking_status and
savings_status? Your code should print:
Most common checking status: …
Most common savings status: …Visualization (24 pts)
1. [9 pts] Plot subplots of two bar charts: one for savings_status (x-axis) and the other for
checking_status (x-axis). In each chart, the y-axis represents number of people. Moreover, for
each category of saving_status (checking_status), we need you to display four bars, eachcorresponding to one of the “personal_status” categories. Each personal status category bar
should be of a different color.2. [9 pts] For people having credit_amount more than 4000, plot a bar graph which maps
property_magnitude (x-axis) to the average customer age for that magnitude (y-axis).
3. [6 pts] For people with a “High” savings_status and age above 40, use subplots to plot the
following pie charts:
a. Personal status
b. Credit history
c. JobGiven an Excel data file as represented by the sample file GooglePlaystore.xlsx (10K records), load it
into a Pandas DataFrame (use the Pandas read_excel method), and perform the following tasks on it.Preprocessing (28 pts)
1. [3 pts] Often there are outliers which do not match the overall data type. There is one record in
this data where the “Reviews” has value “3.0M” which does not match the rest of the data.
Remove that record.2. [4 pts] Remove rows where any of the columns has the value “Varies with device”.
3. [5 pts] The values in the Android version column should be floats. Strip the trailing non-numeric
characters from all values (ie. the words ” and up”), so the result is a number. If there are multiple
decimal places (eg. “x.y.z”), keep only the first two parts (eg “x.y”). For example, the value “4.1
and up” should be changed to “4.1”. The value “4.5.6 and up” should be changed to “4.5”. The
value “5.6.7” should be changed to “5.6”.If there is a range (eg. 5.0 – 8.0), only consider the first number. For example, the value “5.0 – 8.0”
should be changed to “5.0”. The value “4.0.3 – 7.1.1” should be changed to “4.0”.4. [5 pts] The “Installs” column must have integer values. For values that have commas, remove the
commas. For values that have a ‘+’ at the end, remove the ‘+’. Keep only those rows that have an
integer value after these edits.5. [5 pts] For missing rating values, if the number of reviews is less than 100 and installations is less
than 50000, remove the row. Else, fill the missing value with the average value (rounded to 2
decimal places) for the Category of that row.6. [6 pts] Preprocess the Size column to convert the “M” (millions) and “K” (thousands) values into
integers. For instance, 8.7M should be converted to 8700000 and 2.4K should be converted to
2400.Analysis (19 pts)
For the following tasks, do preprocessing or changing of data types in the data frame as required.
1. [4 pts] Describe (use DataFrame describe method) the category wise rating statistics. In other
words, for each category, describe the statistics (count, mean, etc.) for ratings in that category.2. [11 pts] Extract all “Free” apps from the master data frame. Then write a function that, given a
numeric column e.g ‘Rating’), will create and return a dataframe for the top 3 free applications in
each category based on that column. Call the function on each of these columns:
a. Rating (gives top 3 most highly rated applications in each category)b. Installs (gives top 3 most installed applications in each category)
c. Reviews (gives top 3 most reviewed applications in each category)
You don’t need to do anything explicit to break ties.
Each of the returned dataframes have Category and App for the first two columns, and one of
Rating (for a.), Installs (for b.), and Reviews (for c.) as the third column, as for instance:3. [4 pts] Find the average, maximum and minimum price of the paid applications.Visualization (16 pts)
1. [9 pts] In the genre column, break the string of genres into a list. For example, ‘Art & Design;
Creativity’ should be [‘Art & Design’, Creativity’].
Count the number of applications per genre and display it using a pie chart.
Hint: Read about DataFrame.explode()2. [7 pts] Display a box plot of ratings for “Business” and “Education” categories. The boxplots
should be in the same plot.You are given the following description of the entities that need to be stored in the database. Your task is
to design a database schema (set of tables) to store these entities.Your schema must be minimally redundant in storing data. In other words, you should build a set of
tables that minimize the repetition of data, by using foreign keys – credit will be in accordance with this.
• Artist: An individual or a group/band, uniquely identified by their name. An artist might release
albums, as well as songs that are not in albums (singles).• Song: A song has a title and is performed by an artist, either as a part of an album, or as a single
that’s not part of an album. Every song in an album has the release date of the album, but a single
song has its own release date. A song title is unique to an artist (the same artist records a songexactly once), but the title may be shared by multiple artists (i.e. covers).
A song belongs to one or more genres. For example, a song could be in a single genre, such as R
& B, or could be in multiple genres such as Pop and Rock. Genres are pre-defined, and every
song must be in at least one genre. Also, songs in an album need not all be in the same genre.• Album: An album is a collection of songs released by an artist, on a certain date. For example,
the album Achtung Baby was made by the artist (band) U2, released on November 19, 1991. An
album name is not unique, but the combination of album name and artist name is unique.• User: A user is uniquely identified by their username. A user can optionally have one or more
playlists, and optionally have ratings for songs, albums, or playlists. In other words, it’s possible
that a user has no playlists, and hasn’t given any ratings.• Playlist: A user can make any number of playlists of songs. Note: A playlist may not include an
entire album, only individual songs. Each song is either from some album, or a single that’s not in
any album.Every playlist has a title, and a date+time when it was created. A playlist may be modified any
number of times after creation by adding or removing songs, but the title and date+time will not
change.The title of a playlist is not unique since different users might create playlists with the same title.
However, a user’s playlists will have unique titles.• Rating: A user could rate an album, a song (even if it’s in an album), or a playlist. A rating is
limited to 1,2,3,4, or 5 (numeric), and is made on a specific date.Note: The items listed above do NOT necessarily correspond 1-1 with tables, although some of them
might. They simply detail all the data you will need to store whatever table structure you adopt. This also
means you can create as many tables as you need to reduce redundancy.Your database structure should have the most appropriate data type and size for each column in each
table.
For size of data, think of a realistic online music service and imagine how many songs/artists/albums/
playlists/users/ratings it might have to support. The idea is to use the least amount of storage space for
each column that will be able to store the entire range of foreseeable values.Make sure you define and specify all primary keys, foreign keys, unique valued columns or unique
valued combination of columns, and null/non-null properties for columns.
In the document you will submit, type in the create table statement for each of the tables you create in
the database. If you don’t have the full create statement for a table, you will not get credit for it.Note: When you test your design in MySQL, you might use alter table statements after the initial
create. However, for the submission, you are required to rewrite the whole sequence as a single create
table statement per table.Every query must be written in a single SQL statement, meaning that if you were to write it in a MySQL
client session on a terminal, there would be a single terminating semicolon. So, for example, you can
have nested or multiple SQLs for a query, provided you can write it all up with a single terminating
semicolon in a MySQL client session. No Python code!For any of the queries:
• If the result might require breaking ties, then unless otherwise specified in the query, let the
MySQL engine deal with it (you need not do anything explicit)
• If the result has fewer than the required number of entities, report all of them.
• For all queries that ask for ‘top n’ or ‘most’, the result must appear from highest ranked to lowest
ranked.Type the SQL queries in the document you will submit, and make sure to write the query number against
each query. (If you want to play it extra safe, copy the query statement from this list, then write your
answer SQL query.)Write queries for the following.
1. Which 3 genres are most represented in terms of number of songs in that genre?
The result must have two columns, named genre and number_of_songs.
2. Find names of artists who have songs that are in albums as well as outside of albums (singles).
The result must have one column, named artist_name3. What were the top 10 most highly rated albums (highest average user rating) in the period
1990-1999? Break ties using alphabetical order of album names. (Period refers to the rating date,
NOT the date of release).
The result must have two columns, named album_name and average_user_rating.4. Which were the top 3 most rated genres (this is the number of ratings of songs in genres, not the
actual rating scores) in the years 1991-1995? (Years refers to the rating date, NOT the date of
release).
The result must have two columns, named genre_name and number_of_song_ratings.5. Which users have a playlist that has an average song rating of 4.0 or more? (This is the average
of the average song rating for each song in the playlist.) A user may appear multiple times in the
result if more than one of their playlists make the cut.
The result must 3 columns named username, playlist_title, average_song_rating6. Who are the top 5 most engaged users in terms of number of ratings that they have given to
songs or albums? (In other words, they have given the most number of ratings to songs or albums
combined.)The result must have 2 columns, named username and number_of_ratings.7. Find the top 10 most prolific artists (most number of songs) in the years 1990-2010? Count each
song in an album individually.
The result must have 2 columns, named artist_name and number_of_songs.8. Find the top 10 songs that are in most number of playlists. Break ties in alphabetical order of song
titles.
The result must have a 2 columns, named song_title and number_of_playlists.9. Find the top 20 most rated singles (songs that are not part of an album). Most rated meaning
number of ratings, not actual rating scores. The result must have 3 columns, named song_title,
artist_name, number_of_ratings.10. Find all artists who discontinued making music after 1993.
The result should be a single column named artist_title
Assignment, CS210, Homework, solutions
[SOLVED] Cs210 homework assignment 1 to 4 solutions
$25
File Name: Cs210_homework_assignment_1_to_4_solutions.zip
File Size: 395.64 KB
Reviews
There are no reviews yet.