CIS 5450 Homework 1: Data Wrangling and Cleaning (Fall 2024)
Hello future data scientists and welcome to CIS 5450! In this homework, you will familiarize yourself with Pandas and Polars! Both are cute animals and essential libraries for Data Science. This homework is focused on one of the most important tasks in Data Science, preparing datasets so that they can be analyzed, plotted, used for machine learning models, etc…
This homework will be broken into analyzing several datasets across four sections!
1. Working with Amazon Prime Video Data to understand the details behind its movies
2. Working on merged/joined versions of the datasets (more on this later though).
3. Regex
4. Working with Used Cars Dataset and Polars to see performance between Pandas, eager execution in Polars, and lazy execution in Polars.
IMPORTANT NOTE: Before starting, you must click on the “Copy To Drive” option in the top bar. This is the master notebook so you will not be able to save your changes without copying it ! Once you click on that, make sure you are working on that version of the notebook so that your work is saved
Run the following 4 cells to setup the notebook
%set_env HW_ID=cis5450_fall24_HW1
%%capture
!pip install penngrader-client
from penngrader.grader import *
import pandas as pd
import numpy as np
import seaborn as sns
from string import ascii_letters
import matplotlib.pyplot as plt
import datetime as dt
import requests
from lxml import html
import math
import re
import json
import os
!wget -nc https://storage.googleapis.com/penn-cis5450/credits.csv
!wget -nc https://storage.googleapis.com/penn-cis5450/titles.csv
What is Pandas?
Apart from animals, Pandas is a Python library to aid with data manipulation/analysis. It is built with support from Numpy. Numpy is another Python package/library that provides effi cient calculations for matrices and other math problems.
Let’s also get familiarized with the PennGrader. It was developed specifi cally for 545 by a previous TA, Leonardo Murri.
PennGrader was developed to provide students with instant feedback on their answer. You can submit your answer and know whether it’s right or wrong instantly. We then record your most recent answer in our backend database. Let’s try it out! Fill in the cell below with your 8-digit Penn ID and then run the following cell to initialize the grader.
# PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY.
# IF NOT, THE AUTOGRADER WON’T KNOW WHO TO ASSIGN POINTS TO YOU IN OUR BACKEND
# YOUR PENN-ID GOES HERE AS AN INTEGER
STUDENT_ID = 99999998
# You should also update this to a unique “secret” just for this homework, to
# authenticate this is YOUR submission
SECRET = STUDENT_ID
Leave this cell as-is…
%%writefile notebook-config.yaml
grader_api_url: ‘https://23whrwph9h.execute-api.us-east-1.amazonaws.com/default/Grader23’
grader_api_key: ‘flfkE736fA6Z8GxMDJe2q8Kfk8UDqjsG3GVqOFOa’
grader = PennGrader(‘notebook-config.yaml’, “cis5450_fall24_HW1”, STUDENT_ID, STUDENT_ID)
We will use scores from Penn Grader to determine your grade. You will still need to submit your notebook so we can check for cheating and plagarism. Do not cheat.
Note: If you run Penn Grader after the due date for any question, your assignment will be marked late, even if you already had full points for the question before the deadline. To remedy this, if you’re going to run your notebook after the deadline, either do not run the grading cells, or reinitialize the grader with an empty or clearly fake ID such as 999999999999 (please use 10+ digits to be clearly a fake STUDENT_ID )
Adding our data so that our code can find it
We can’t be data scientists without data! We provided code for you to download the data (the “wget” cell from earlier). If you go to the view on the left and click files, you should see something similar to this image
Part 1: Working with Amazon Prime Video Data [38 points]
In this part of the homework we will be working with a dataset focused on Amazon Prime Video Movie Data!
1.0 Loading in Titles data (2 points)
Let’s first load our dataset into a Pandas Dataframe. Use Pandas’s read_csv functionality, which you can find documentation for here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
While reading documentation is hard at first, we strongly encourage you to get into the habit of doing this, since many times your questions will be answered directly by the documentation (ex: “why isn’t my dataframe. dropping duplicates” or “why didn’t this dataframe. update”).
TODO
Save the Credits dataframe. to a variable named: credits_df
Save the Titles dataframe. to a variable named: titles_df
#TODO: Import your two files to pandas dataframes — make sure the dataframes are named correctly!
Let’s focus on the titles_df for now and see what the dataframe. looks like. Display the first 10 rows of the dataframe. in the cell below (take a look at the documentation to find how to do this!)
#TODO: Display the first 10 rows of `titles_df`
Another thing that is often times helpful to do is inspect the types of each column in a dataframe. Output the types of titles_df in this cell below.
# TODO: Display the datatypes in `titles_df`
Save the types of the type , release_year , runtime , seasons , imdb_id , and tmdb_score columns to a series called titles_df_types (retaining the index names) and pass them into the autograder cell below.
# View the output here!
titles_df_types =
# TEST CASE: titles_df_types (2pt)
# [CIS 545 PennGrader Cell] – 2 points
grader.grade(test_case_id = ‘titles_df_types’, answer = titles_df_types)
1.1 Cleaning up Titles data (4 points)
When you work with data, you’ll have NaNs, duplicates or columns that don’t give much insight into the data. There are different ways to deal with missing values (i.e. imputation, which you can read into on your own), but for now, let’s drop some of these rows in titles_df to clean up our data. Note that there might be multiple ways to do each step. Also note that a lot of the columns in titles_df have all nulls. Thus, ensure to drop the unnecessary columns before filtering out rows with nulls
Refer to the documentation if you get stuck — it’s your best friend!
TODO: 1.1
Make a new data frame. titles_cleaned_df
Keep only the following columns: id , title , type , release_year , runtime , genres , production_countries , imdb_score , imdb_votes , tmdb_popularity , tmdb_score .
Drop rows that have NaNs in them. Use the info function to see the number of null rows in this DataFrame. before this, and afterward to sense check that your operation is correct
Reset the index and drop the index column which stores the original index prior to resetting the index. We recommend you print out the intermediate dataframe. prior to this to see that the indices are not consecutive!
Cast title , type to type string , and imdb_votes to type int .
Save the result to titles_cleaned_df .
Note: The affected string columns should appear as string datatype and not object (you can check using df.dtypes). If it is not, we recommend checking up on this documentation to see how to successfully convert object into strings (Hint: cast as ‘string’ and not str).
#TODO: Keep only the necessary columns
#TODO: Drop nulls
#TODO: Reset and drop the index
#TODO: Cast type
# TEST CASE: titles_cleaned_df (4pt)
# [CIS 545 PennGrader Cell] – 4 points
grader.grade(test_case_id = ‘titles_cleaned_df’, answer = titles_cleaned_df)
1.2 Data Wrangling with Titles Data (8 points)
Now, let’s process the data in an appropriate format so that we can answer some queries more easily. Make sure to use titles_cleaned_df for this part.
TODO: 1.2
Create a column called is_movie that contains a value of 1 if the type of content is MOVIE and a value of 0 if not.
Create the genres_expanded column in titles_cleaned_df to create individual rows for each genre of each movie. Hint: Make sure it is the correct type before doing this!
Similar to before, create a production_countries_expanded column to create individual rows for each country where the movie was produced.
Drop the redundant columns type , genres , and production_countries , as well as all null values, saving the result as titles_final_df . Make sure to reset and drop the index as well! (8 points)
Hint: See apply, explode, json.loads, and lambda in Python documentation.
Note: Feel free to reference this Geeks4Geeks link to better understand how to use the json.loads() function. You may not import ast or eval to do this.
Note: We recommend printing out the intermediate steps and testing your logic on singular values and getting the correct answer, before applying it to the entire dataframe!
Note: Include rows with the type SHOW, too.
# TODO
# TEST CASE: titles_final_df (8pt)
# [CIS 545 PennGrader Cell] – 8 points
grader.grade(test_case_id = ‘titles_final_df’, answer = titles_final_df)
1.3 Compute the Top Performing Genres
1.3.1 Compute the Best Genres By IMDb and TMDb Score (6 points)
In this section we will compute the top performing genres, and will use both data from the Internet Movie Database (IMDb) and The Movie Database (TMDb) to do so. We will use titles_final_df in this section.
TODO: 1.3.1
Use groupby() function
Create a dataframe. genres from titles_final_df with only the columns genres_expanded , tmdb_popularity , imdb_score and tmdb_score .
Filter genres to only keep those movies with tmdb_popularity greater than 2.0.
Create a dataframe. genres_imdb_df that contains the average imdb_score for each genre_expanded . Make sure to keep the resultant genres_expanded and imdb_score columns
Sort this in descending order, keeping only the top 10 values
Create a column called score that is the average score rounded to two decimal places
Reset the index and drop the index column
Have only score and genres_expanded as part of genres_imdb_df
Do the same steps for creating genres_imdb_df to create genres_tmdb_df with tmdb_score instead!
#TODO: Create genres
#TODO: Create genres_imdb_df
#TODO: Create genres_tmdb_df
# TEST CASE: genres_df (6pt)
# [CIS 545 PennGrader Cell] – 6 points
grader.grade(test_case_id = ‘genres_df’, answer = (genres_imdb_df, genres_tmdb_df))
1.3.2 Compute the Percentage Difference Between Genres (6 points)
In this section we will compute the differences in results between genres_imdb_df and genres_tmdb_df .
TODO: 1.3.2
Merge genres_imdb_df and genres_tmdb_df on genres_expanded to create merged_df .
Rename the score columns to score_imdb and score_tmdb respectively
Create a column difference in merged_df that is defi ned as the absolute value of the percentage difference between score_imdb and score_tmdb . Hint: Check out the abs function for help with this! Use the following formula for this:
Make sure do not use use Python iteration (e.g., for loop, while loop).
Sort merged_df in descending order by difference
Reset the index and drop the index column
merged_df should have score_imdb , genres_expanded , score_tmdb , and difference
# TODO
# TEST CASE: merged_df (6pt)
# [CIS 545 PennGrader Cell] – 6 points
grader.grade(test_case_id = ‘merged_df’, answer = merged_df)
Reviews
There are no reviews yet.