CSE 6242 / CX 4242: Data and Visual Analytics HW 1: End-to-end analysis of TMDb data, SQLite, D3 Warmup, OpenRefine, FlaskVast amounts of digital data are generated each day, but raw data is often not immediately “usable”. Instead, we are interested in the information content of the data such as what patterns are captured? This assignment covers useful tools for acquiring, cleaning, storing, and visualizing datasets. In questions 1 & 2, we’ll perform a simple end-to-end analysis using data from The Movie Database (TMDb).We will collect movie data via API, store the data in csv files, and analyze data using SQL queries. For Q3, we will complete a D3 warmup to prepare our students for visualization questions in HW2. Q4 & 5 will provide an opportunity to explore other industry tools used to acquire, store, and clean datasets. The maximum possible score for this homework is 100 points. Contents Download the HW1 Skeleton before you begin. ………………………………………………………………………… 1Homework Overview…………………………………………………………………………………………………………………. 1 Important Notes ………………………………………………………………………………………………………………………. 2 Submission Notes…………………………………………………………………………………………………………………….. 2 Do I need to use the specific version of the software listed?……………………………………………………………. 2 Q1 [40 points] Collect data from TMDb to build a co-actor network…………………………………………………… 3 Q2 [35 points] SQLite ……………………………………………………………………………………………………………….. 4 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species…………………………………………….. 7 Q4 [5 points] OpenRefine …………………………………………………………………………………………………………10 Q5 [5 points] Introduction to Python Flask …………………………………………………………………………………..12 2 Version 1Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as you need. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of an array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 1 Q1 [40 points] Collect data from TMDb to build a co-actor network Leveraging the power of APIs for data acquisition, you will build a co-actor network of highly rated movies using information from The Movie Database (TMDb). Through data collection and analysis, you will create a graph showing the relationships between actors based on their highly rated movies. This will not only highlight the practical application of APIs in collecting rich datasets, but also introduce the importance of graphs in understanding and visualizing the real-world dataset. Technology • Python 3.10.x • TMDb API version 3 Allowed Libraries The Python Standard Library and Requests only. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q1.py: The completed Python file • nodes.csv: The csv file containing nodes • edges.csv: The csv file containing edges Follow the instructions found in Q1.py to complete the Graph class, the TMDbAPIUtils class, and the one global function. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDb API for data retrieval. Tasks and point breakdown 1. [10 pts] Implementation of the Graph class according to the instructions in Q1.py. a. The graph is undirected, thus {a, b} and {b, a} refer to the same undirected edge in the graph; keep only either {a, b} or {b, a} in the Graph object. A node’s degree is the number of (undirected) edges incident on it. In/ out-degrees are not defined for undirected graphs. 2. [10 pts] Implementation of the TMDbAPIUtils class according to instructions in Q1.py. Use version 3 of the TMDb API to download data about actors and their co-actors. To use the API: a. Create a TMDb account and follow the instructions on this document to obtain an API key. b. Be sure to use the key, not the token. This is the shorter of the two. c. Refer to the TMDB API Documentation as you work on this question. 3. [20 pts] Producing correct nodes.csv and edges.csv. a. If an actor’s name has comma characters (“,”), remove those characters before writing that name into the CSV files. 4 Version 1 Q2 [35 points] SQLite SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file that does not need to be parsed explicitly (unlike CSV files, which must be parsed). You can find instructions to install SQLite here. In this question, you will construct a TMDb database in SQLite, partition it, and combine information within tables to answer questions. You will modify the given Q2.py file by adding SQL statements to it. We suggest testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite. Technology • SQLite release 3.37.2 • Python 3.10.x Allowed Libraries Do not modify import statements. Everything you need to complete this question has been imported for you. Do not use other libraries for this question. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q2.py: Modified file containing all the SQL statements you have used to answer parts a – h in the proper sequence. IMPORTANT NOTES: • If the final output asks for a decimal column, format it to two places using printf(). Do NOT use the ROUND() function, as in rare cases, it works differently on different platforms. If you need to sort that column, be sure you sort it using the actual decimal value and not the string returned by printf. • A sample class has been provided to show example SQL statements; you can turn off this output by changing the global variable SHOW from True to False. • In this question, you must only use INNER JOIN when performing a join between two tables, except for part g. Other types of joins may result in incorrect results. Tasks and point breakdown 1. [9 points] Create tables and import data. a. [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named movies and movie_cast with columns having the indicated data types: i. movies 1. id (integer) 2. title (text) 3. score (real) ii. movie_cast 1. movie_id (integer) 2. cast_id (integer) 3. cast_name (text) 4. birthday (text) 5. popularity (real) b. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table i. Write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Make sure you use paths relative to the Q2 directory. c. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that divides large tables into smaller tables, which may help speed up queries. Create a new table cast_bio from the movie_cast table. Be sure that the values are unique when inserting into the new cast_bio table. Read this page for an example of vertical database partitioning. 5 Version 1 i. cast_bio 1. cast_id (integer) 2. cast_name (text) 3. birthday (text) 4. popularity (real) 2. [1 point] Create indexes. Create the following indexes. Indexes increase data retrieval speed; though the speed improvement may be negligible for this small database, it is significant for larger databases. a. movie_index for the id column in movies table b. cast_index for the cast_id column in movie_cast table c. cast_bio_index for the cast_id column in cast_bio table 3. [3 points] Calculate a proportion. Find the proportion of movies with a score between 7 and 20 (both limits inclusive). The proportion should be calculated as a percentage. a. Output format and example value: 7.70 4. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order. a. Output format and example row values (cast_name,appearance_count): Harrison Ford,2 5. [4 points] List the 5 highest-scoring movies. In the case of a tie, prioritize movies with fewer cast members. Sort the result by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. a. Output format and example values (movie_title,score,cast_count): Star Wars: Holiday Special,75.01,12 Games,58.49,33 6. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Sort the output by average_score in descending order, then by cast_name alphabetically. a. Exclude movies with score < 25 before calculating average_score. b. Include only cast members who have appeared in three or more movies with score >= 25. i. Output format and example value (cast_id,cast_name,average_score): 8822,Julia Roberts,53.00 7. [2 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 2 movies together AND the average score of these movies is >= 40. The view should have the format: good_collaboration( cast_member_id1, cast_member_id2, movie_count, average_movie_score) For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self-pair” where cast_member_id1 is the same as cast_member_id2. Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading. NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that 6 Version 1 you may have used for testing. Optional Reading: Why create views? 8. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2. a. Order your output by collaboration_score in descending order, then by cast_name alphabetically. b. Output format and example values(cast_id,cast_name,collaboration_score): 2,Mark Hamil,99.32 1920,Winoa Ryder,88.32 9. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). a. [1 point] Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview id (integer) overview (text) NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR, and NOT are case-sensitive in FTS queries. NOTE: If you have issues that fts is not enabled, try the following steps • Go to sqlite3 downloads page: https://www.sqlite.org/download.html • Download the dll file for your system • Navigate to your Python packages folder, e.g., C:Users… …Anaconda3pkgssqlite-3.29.0- he774522_0Librarybin • Drop the downloaded .dll file in the bin. • In your IDE, import sqlite3 again, fts should be enabled. b. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings. i. Example: Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’ Disallowed: ‘gunfight’, ‘fighting’, etc. ii. Output format and example value: 12 c. [2 points] Count the number of movies that contain the terms ‘space’ and ‘program’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. i. Example: Allowed: ‘In Space there was a program’, ‘In this space program’ Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’ ii. Output format and example value: 6 7 Version 1 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species In this question, you will utilize a dataset provided by TRAFFIC, an NGO working to ensure the global trade of wildlife is both legal and sustainable. TRAFFIC provides data through their interactive Wildlife Trade Portal, some of which we have already downloaded and pre-processed for you to utilize in Q3. Using species-related data, you will build a bar chart to visualize the most frequently illegally trafficked species between 2015 and 2023. Using D3, you will get firsthand experience with how interactive plots can make data more visually appealing, engaging, and easier to parse. Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition (sign in using your GT account, e.g., jdoe3@gatech.edu). This reading provides an important foundation you will need for Homework 2. The question and autograder have been developed and tested for D3 version 5 (v5), while the book covers v4. What you learn from the book is transferable to v5, as v5 introduced few breaking changes. We also suggest briefly reviewing chapters 1-3 for background information on web development. TRAFFIC International (2024) Wildlife Trade Portal. Available at www.wildlifetradeportal.org. Technology • D3 Version 5 (included in the lib folder) • Chrome 97.0 (or newer): the browser for grading your code • Python HTTP server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. Deliverables • Q3.html: Modified file containing all html, javascript, and any css code required to produce the bar plot. Do not include the D3 libraries or q3.csv dataset. IMPORTANT NOTES: • Setup an HTTP server to run your D3 visualizations as discussed in the D3 lecture (OMS students: watch lecture video. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder. • We have provided sections of skeleton code and comments to help you complete the implementation. While you do not need to remove them, you need to write additional code to make things work. • All d3*.js files are provided in the lib folder and referenced using relative paths in your html file. For example, since the file “Q3/Q3.html” uses d3, its header contains:. It is incorrect to use an absolute path such as:. The 3 files that are referenced are: a. lib/d3/d3.min.js b. lib/d3-dsv/d3-dsv.min.js c. lib/d3-fetch/d3-fetch.min.js • In your html / js code, use a relative path to read the dataset file. For example, since Q3 requires reading data from the q3.csv file, the path must be “q3.csv” and NOT an absolute path such as “C:/Users/polo/HW1-skeleton/Q3/q3.csv”. Absolute paths are specific locations that exist only on your computer, which means your code will NOT run on our machines when we grade, and you will lose points. As file paths are case-sensitive, ensure you correctly provide the relative path. • Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv(). Handle any data conversions that might be needed, e.g., strings that need to be converted to integer. See https://github.com/d3/d3-fetch#dsv. • VERY IMPORTANT: Use the Margin Convention guide to specify chart dimensions and layout. Tasks and point breakdown Q3.html: When run in a browser, should display a horizontal bar plot with the following specifications: 8 Version 1 1. [3.5 points] The bar plot must display one bar for each of the five most trafficked species by count. Each bar’s length corresponds to the number of wildlife trafficking incidents involving that species between 2015 and 2023, represented by the ‘count’ column in our dataset. 2. [1 point] The bars must have the same fixed thickness, and there must be some space between the bars, so they do not overlap. 3. [3 points] The plot must have visible X and Y axes that scale according to the generated bars. That is, the axes are driven by the data that they are representing. They must not be hard-coded. The x-axis must be a element having the id: “x_axis” and the y-axis must be a element having the id: “y_axis”. 4. [2 points] Set x-axis label to ‘Count’ and y-axis label to ‘Species’. The x-axis label must be a element having the id: “x_axis_label” and the y-axis label must be a element having the id: “y_axis_label”. 5. [2 points] Use a linear scale for the X-axis to represent the count (recommended function: d3.scaleLinear()). Only display ticks and labels at every 500 interval. The X-axis must be displayed below the plot. 6. [2 points] Use a categorical scale for the Y-axis to represent the species names (recommended function: d3.scaleBand()). Order the species names from greatest to least on ‘Count’ and limit the output to the top 5 species. The Y-axis must be displayed to the left of the plot. 7. [1 point] Set the HTML title tag and display a title for the plot. Those two titles are independent of each other and need to be set separately. Set the HTML title tag (i.e.,). Position the title “Wildlife Trafficking Incidents per Species (2015 to 2023)” above the bar plot. The title must be a element having the id: “title”. 8. [0.25 points] Add your GT username (usually includes a mix of letters and numbers) to the area beneath the bottom-right of the plot. The GT username must be a element having the id: “credit” 9. [0.25 points] Fill each bar with a unique color. We recommend using a colorblind-safe pallete. NOTE: Gradescope will render your plot using Chrome and present you with a Dropbox link to view the screenshot of your plot as the autograder sees it. This visual feedback helps you adjust and identify errors, e.g., a blank plot indicates a serious error. Your design does not need to replicate the solution plot. However, the autograder requires the following DOM structure (including using correct IDs for elements) and sizing attributes to know how your chart is built. 9 Version 1 plot | width: 900 | height: 370 | +– containing Q3.a plot elements | +– containing bars | +– x-axis | | | +– (x-axis elements) | +– x-axis label | +– y-axis | | | +– (y-axis elements) | +– y-axis label | +– GTUsername | +– chart title 10 Version 1 Q4 [5 points] OpenRefine OpenRefine is a powerful tool for working with messy data, allowing users to clean and transform data efficiently. Use OpenRefine in this question to clean data from Mercari. Construct GREL queries to filter the entries in this dataset. OpenRefine is a Java application that requires Java JRE to run. However, OpenRefine v.3.6.2 comes with a compatible Java version embedded with the installer. So, there is no need to install Java separately when working with this version. Go through the main features on OpenRefine’s homepage. Then, download and install OpenRefine 3.6.2. The link to release 3.6.2 is https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2 Technology • OpenRefine 3.6.2 Deliverables • properties_clean.csv: Export the final table as a csv file. • changes.json: Submit a list of changes made to file in json format. Go to ‘Undo/Redo’ Tab → ‘Extract’ → ‘Export’. This downloads ‘history.json’ . Rename it to ‘changes.json’. • Q4Observations.txt: A text file with answers to parts b.i, b.ii, b.iii, b.iv, b.v, b.vi. Provide each answer in a new line in the output format specified. Your file’s final formatting should result in a .txt file that has each answer on a new line followed by one blank line. Tasks and point breakdown 1. Import Dataset a. Run OpenRefine and point your browser at https://127.0.0.1:3333. b. We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price Suggestion Challenge). If you are interested in the details, visit the data description page. We have sampled a subset of the dataset provided as “properties.csv”. c. Choose “Create Project” → This Computer → properties.csv. Click “Next”. d. You will now see a preview of the data. Click “Create Project” at the upper right corner. 2. [5 points] Clean/Refine the Data a. [0.5 point] Select the category_name column and choose ‘Facet by Blank’ (Facet → Customized Facets → Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True in Q4Observations.txt. Exclude these rows. Output format and sample values: i.rows: 500 NOTE: OpenRefine maintains a log of all changes. You can undo changes by the “Undo/Redo” button at the upper left corner. You must follow all the steps in order and submit the final cleaned data file properties_clean.csv. The changes made by this step need to be present in the final submission. If they are not done at the beginning, the final number of rows can be incorrect and raise errors by the autograder. b. [1 point] Split the column category_name into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’) and does not remove the original category_name column. Provide the number of new columns that are created by this operation, excluding the original category_name column. Output format and sample values: ii.columns: 10 11 Version 1 NOTE: While multiple methods can split data, ensure new columns aren’t empty. Validate by sorting and checking for null values after using our suggested method in step b. c. [0.5 points] Select the column name and apply the Text Facet (Facet → Text Facet). Cluster by using (Edit Cells → Cluster and Edit …) this opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the keying function that produces the smallest number of clusters under the “Key Collision” method. Click ‘Select All’ and ‘Merge Selected & Close’. Provide the name of the keying function and the number of clusters produced. Output format and sample values: iii.function: fingerprint, 200 NOTE: Use the default Ngram size when testing Ngram-fingerprint. d. [1 point] Replace the null values in the brand_name column with the text “Unknown” (Edit Cells → Transform). Provide the expression used. Output format and sample values: iv.GREL_categoryname: endsWith(“food”, “ood”) NOTE: “Unknown” is case and space sensitive (“Unknown” is different from “unknown” and “Unknown “.) e. [0.5 point] Create a new column high_priced with the values 0 or 1 based on the “price” column with the following conditions: if the price is greater than 90, high_priced should be set as 1, else 0. Provide the GREL expression used to perform this. Output format and sample values: v.GREL_highpriced: endsWith(“food”, “ood”) f. [1.5 points] Create a new column has_offer with the values 0 or 1 based on the item_description column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in has_offer as 1, else 0. Provide the GREL expression used to perform this. Convert the text to lowercase in the GREL expression before you search for the terms. Output format and sample values: vi.GREL_hasoffer: endsWith(“food”, “ood”) 12 Version 1 Q5 [5 points] Introduction to Python Flask Flask is a lightweight web application framework written in Python that provides you with tools, libraries, and technologies to build a web application quickly and scale it up as needed. In this question, you will build a web application that displays a table of TMDb data on a single-page website using Flask. You will modify the given file: wrangling_scripts/Q5.py Technology Python 3.10.x Flask Allowed Libraries Python standard libraries Libraries already imported in Q5.py Deliverables Q5.py: Completed Python file with your changes Tasks and point breakdown 1. username() – Update the username() method inside Q5.py by including your GT username. 2. Install Flask on your machine by running $ pip install Flask a. You can optionally create a virtual environment by following the steps here. Creating a virtual environment is purely optional and can be skipped. 3. To run the code, navigate to the Q5 folder in your terminal/command prompt and execute the following command: python run.py. After running the command, go to http://127.0.0.1:3001/ on your browser. This will open up index.html, showing a table in which the rows returned by data_wrangling() are displayed. 4. You must solve the following two sub-questions: a. [2 points] Read and store the first 100 rows in a table using the data_wrangling() method. NOTE: The skeleton code, by default, reads all the rows from movies.csv. You must add the required code to ensure that you are reading only the first 100 data rows. The skeleton code already handles reading the table header for you. b. [3 points]: Sort this table in descending order of the values, i.e., with larger values at the top and smaller values at the bottom of the table in the last (3rd) column. Note that this column needs to be returned as a string for the autograder, but sorting may require float casting.CSE 6242 / CX 4242: Data and Visual Analytics HW 2: Tableau, D3 Graphs, and Visualization“Visualization gives you answers to questions you didn’t know you have” – Ben Schneiderman Download the HW2 Skeleton before you beginData visualization is an integral part of exploratory analysis and communicating key insights. This homework focuses on exploring and creating data visualizations using two of the most popular tools in the field; Tableau and D3.js. All 5 questions use data on the same topic to highlight the uses and strengths of different types of visualizations. The data comes from BoardGameGeek and includes games’ ratings, popularity, and metadata. Below are some terms you will often see in the questions: • Rating – a value from 0 to 10 given to each game. BoardGameGeek calculates a game’s overall rating in different ways including Average and Bayes, so make sure you are using the correct rating called for in a question. A higher rating is better than a lower rating. • Rank – the overall rank of a boardgame from 1 to n, with ranks closer to 1 being better and n being the total number of games. The rank may be for all games or for a subgroup of games such as abstract games or family games. The maximum possible score for this homework is 100 points. Students have the option to complete any 90 points’ worth of work to receive 100% (equivalent to 15 course total grade points) for this assignment. They can earn more than 100% if they submit additional work. For example, a student scoring 100 points will receive 111% for the assignment (equivalent to 16.67 course total grade points, as shown on Canvas). Download the HW2 Skeleton before you begin ……………………………………………………………………………….. 1Homework Overview …………………………………………………………………………………………………………………… 1 Important Notes………………………………………………………………………………………………………………………….. 2 Submission Notes ………………………………………………………………………………………………………………………. 2 Do I need to use the specific version of the software listed?………………………………………………………………. 2 Q1 [25 points] Designing a good table. Visualizing data with Tableau. ………………………………………………… 3 Important Points about Developing with D3 in Questions 2–5 ……………………………………………………………. 7 Q2 [15 points] Force-directed graph layout……………………………………………………………………………………… 8 Q3 [15 points] Line Charts……………………………………………………………………………………………………………10 Q4 [20 points] Interactive Visualization…………………………………………………………………………………………..14 Q5 [25 points] Choropleth Map of Board Game Ratings……………………………………………………………………18 2 Version 0Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as needed. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. a. Question 1 will be manually graded after the final HW due date and Grace Period. b. Questions 2-5 are auto graded at the time of submission. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 0 Q1 [25 points] Designing a good table. Visualizing data with Tableau. Goal Design a table, a grouped bar chart, and a stacked bar chart with filters in Tableau. Technology Tableau Desktop Deliverables Gradescope: After selecting HW2 – Q1, click Submit Images. You will be taken to a list of questions for your assignment. Click Select Images and submit the following four PNG images under the corresponding questions: ● table.png: Image/screenshot of the table in Q1.1 ● grouped_barchart.png: Image of the chart in Q1.2 ● stacked_barchart_1.png: Image of the chart in Q1.3 after filtering data for Max.Players = 2 ● stacked_barchart_2.png: Image of the chart in Q1.3 after filtering data for Max.Players = 4 a Q1 will be manually graded after the grace period. Setting Up Tableau Install and activate Tableau Desktop by following “HW2 Instructions” on Canvas. The product activation key is for your use in this course only. Do not share the key with anyone. If you already have Tableau Desktop installed on your machine, you may use this key to reactivate it. a If you do not have access to a Mac or Windows machine, use the 14-day trial version of Tableau Online: 1. Visit https://www.tableau.com/trial/tableau-online 2. Enter your information (name, email, GT details, etc.) 3. You will then receive an email to access your Tableau Online site 4. Go to your site and create a workbook a If neither of the above methods work, use Tableau for Students. Follow the link and select “Get Tableau For Free”. You should be able to receive an activation key which offers you a one-year use of Tableau Desktop at no cost by providing a valid Georgia Tech email. Connecting to Data 1. It is optional to use Tableau for Q1.1. Otherwise, complete all parts using a single Tableau workbook. 2. Q1 will require connecting Tableau to two different data sources. You can connect to multiple data sources within one workbook by following the directions here. 3. For Q1.1 and Q1.2: a. Open Tableau and connect to a data source. Choose To a File – Text file. Select the popular_board_game.csv file from the skeleton. b. Click on the graph area at the bottom section next to “Data Source” to create worksheets. 4. For Q1.3: a. You will need a data.world account to access the data for Q1.3. Add a new data source by clicking on Data – New Data Source. b. When connecting to a data source, choose To a Server – Web Data Connector. c. Enter this URL to connect to the data.world data set on board games. You may be prompted to log in to data-world and authorize Tableau. If you haven’t used data.world before, you will be required to create an account by clicking “Join Now”. Do not edit the provided SQL query. a NOTE: If you cannot connect to data-world, you can use the provided csv files for Q1 in the skeleton. The provided csv files are identical to those hosted online and can be loaded directly into Tableau. a d. Click the graph area at the bottom section to create another worksheet, and Tableau will automatically create a data extract. 4 Version 0 Table and Chart Design 1. [5 points] Good table design. Visualize the data contained in popular_board_game.csv as a data table (known as a text table in Tableau). In this part (Q1.1), you can use any tool (e.g., Excel, HTML, Pandas, Tableau) to create the table. We are interested in grouping popular games into “support solo” (min player = 1) and “not support solo” (min player > 1). Your table should clearly communicate information about these two groups simultaneously. For each group (Solo Supported, Solo Not Supported), show: a a. Total number of games in each category (fighting, economic, …) b. In each category, the game with the highest number of ratings. If more than one game has the same (highest) number of ratings, pick the game you prefer. NOTE: Level of Detail expressions may be useful if you use Tableau. c. Average rating of games in each category (use simple average), rounded to 2 decimal places. d. Average playtime of games in each category, rounded to 2 decimal places. e. In the bottom left corner below your table, include your GT username (In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. If you use Tableau, refer to the tutorial here). f. Save the table as table.png. (If you use Tableau, go to Worksheet/Dashboard Export Image). NOTE: Do not take screenshots in Tableau since your image must have high resolution. You can take a screenshot If you use HTML, Pandas, etc. a Your learning goal here is to practice good table design, which is not strongly dependent on the tool that you use. Thus, we do not require that you use Tableau in this part. You may decide the most meaningful column names, the number of columns, and the column order. You are not limited to only the techniques described in the lecture. For OMS students, the lecture video on this topic is Week 4 – Fixing Common Visualization Issues – Fixing Bar Charts, Line Charts. For campus students, review lecture slides 42 and 43. 2. [10 points] Grouped bar chart. Visualize popular_board_game.csv as a grouped bar chart in Tableau. Your chart should display game category (e.g., fighting, economic,…) along the horizontal axis and game count along the vertical axis. Show game playtime (e.g., <=30, (30, 60]) for each category. NOTE: Do not differentiate between “support solo” and “non-support solo” for this question. a. a a. Design a vertically grouped bar chart. For each category, show the game count for each playtime. b. Include clearly labeled axes, a clear chart title, and a legend. c. In the bottom left corner of your image, include your GT username.NOTE: In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. Refer to the tutorial here. d. Save the chart as grouped_barchart.png (go to Worksheet/Dashboard Export Image. a. NOTE: Do not take screenshots in Tableau since your image must have high resolution. The main goal here is for you to get familiarized with Tableau. Thus, we kept this open-ended, so you can practice making design decisions. We will accept most designs. We show one possible design in Figure 1.2, based on the tutorial from Tableau. 3. [10 points] Stacked bar chart. Visualize the data.world dataset (or games_detailed_info_filtered.csv if using the local files in the skeleton) as a stacked bar chart. Showcase the count of games in different categories and the relationship between game categories, their mechanics, and max player size. a a. Create a Worksheet with a stacked bar chart that shows game counts for each playing mechanic (sub-bars) for each game category. NOTE: This data contains duplicate rows, as each row represents a distinct game. Do not remove duplicate rows from the data. b. Display game counts along the vertical axis and category along the horizontal axis. c. Include clear axes labels, a clear chart title, and a legend. d. Create a Dashboard using the worksheet you created. 5 Version 0 e. Add a filter for the number of ‘Max.Players’ allowed in each game. Update the chart using this filter to generate the following chart images (Refer to the tutorial here on how to add a filter in a dashboard. Make sure to add ‘Max.Players’ in the filter shelf in the Worksheet first, like this): i. Select “2 Players” only in the filter. Save the resulting chart as ‘stacked_barchart_1.png’ ii. Select “4 Players” only in the filter. Save the resulting chart as ‘stacked_barchart_2.png’ iii. Both images must include your GT username in the bottom left. This can be added using a text box. Refer to the tutorial here.https://youtu.be/fRwQenvBJ6I iv. In each image, the filter must be visible. If you are using Tableau Online, you may need to add your worksheet containing the chart to a dashboard and then download an image of the dashboard that contains both the filter and the chart. Note: To save a dashboard image, go to Dashboard – Export Image. Do not submit screenshots. An example of a possible design is shown in Figure 1.3. Optional Reading: The effectiveness of stacked bar charts is often debated—sometimes, they can be confusing, difficult to understand, and may make data series comparisons challenging. Figure 1.2: Example of a grouped bar chart. Your chart may appear different and can earn full credit if it meets all the stated requirements. Your submitted image should include your GT username in the bottom left. 6 Version 0 Figure 1.3: Example of a stacked bar chart after selecting “4 Players” in Max.Players filter. Your chart may appear different and can earn full credit if it meets all the stated requirements. Your submitted image should include your GT username in the bottom left. 7 Version 0 Important Points about Developing with D3 in Questions 2–5 1. We highly recommend that you use the latest Chrome browser to complete this question. We will grade your work using Chrome v92 (or higher). 2. You will work with version 5 of D3 in this homework. You must NOT use any D3 libraries (d3*.js) other than the ones provided in the lib folder. 3. For Q3–5, your D3 visualization MUST produce a DOM structure as specified at the end of each question. Not only does the structure help guide your D3 code design, but it also enables your code to be auto-graded (the auto-grader identifies and evaluates relevant elements in the rendered HTML). We highly recommend you review the specified DOM structure before starting to code. 4. You need to setup a local HTTP server in the root (hw2-skeleton) folder to run your D3 visualizations, as discussed in the D3 lecture (OMS students: the video “Week 5 – Data Visualization for the Web (D3) – Prerequisites: JavaScript and SVG”. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. (for more details, see link). 5. All d3*.js files in the lib folder must be referenced using relative paths, e.g., “../lib/” in your html files. For example, if the file “Q2/submission.html” uses d3, its header should contain:It is incorrect to use an absolute path such as:6. For questions that require reading from a dataset, use a relative path to read in the dataset file. For example, suppose a question reads data from earthquake.csv, the path should simply be “earthquake.csv” and NOT an absolute path such as “C:/Users/polo/hw2-skeleton/Q/earthquake.csv”. 7. You can and are encouraged to decouple the style, functionality and markup in the code for each question. That is, you can use separate files for CSS, JavaScript and html. 8 Version 0 Q2 [15 points] Force-directed graph layout Goal Create a network graph shows relationships between games in D3. Use interactive features like pinning nodes to give the viewer some control over the visualization. Technology D3 Version 5 (included in the lib folder) Chrome v92.0 (or higher): the browser for grading your code Python http server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. On Gradescope, these libraries are provided for you in the auto-grading environment. Deliverables [Gradescope] Q2.(html/js/css): The HTML, JavaScript, CSS to render the graph. Do not include the D3 libraries or board_games.csv dataset. You will experiment with many aspects of D3 for graph visualization. To help you get started, we have provided the Q2.html file (in the Q2 folder) and an undirected graph dataset of boardgames, board_games.csv file (in the Q2 folder). The dataset for this question was inspired by a Reddit post about visualizing boardgames as a network, where the author calculates the similarity between board games based on categories and game mechanics where the edge value between each board game (node) is the total weighted similarity index. This dataset has been modified and simplified for this question and does not fully represent actual data found from the post. The provided Q2.html file will display a graph (network) in a web browser. The goal of this question is for you to experiment with the visual styling of this graph to make a more meaningful representation of the data. Here is a helpful resource (about graph layout) for this question. Note: You can submit a single Q2.html that contains all the css and js components; or you can split Q2.html into Q2.html, Q2.css, and Q2.js. 1. [2 points] Adding node labels: Modify Q2.html to show the node label (the node name, e.g., the source) at the top right of each node in bold. If a node is dragged, its label must move with it. 2. [3 points] Styling edges: Style the edges based on the “value” field in the links array: 1. If the value of the edge is equal to 0 (similar), the edge should be gray, thick, and solid (The dashed line with zero gap is not considered as solid). 2. If the value of the edge is equal to 1 (not similar), the edge should be green, thin, and dashed. 3. [3 points] Scaling nodes: a. [1.5 points] Scale the radius of each node in the graph based on the degree of the node (you may try linear or squared scale, but you are not limited to these choices). Note: Regardless of which scale you decide to use, you should avoid extreme node sizes, which will likely lead to low-quality visualization (e.g., nodes that are mere points, barely visible, or of huge sizes with overlaps). Note: D3 v5 does not support d.weight (which was the typical approach to obtain node degree in D3 v3). You may need to calculate node degrees yourself. Example relevant approach is here. b. [1.5 points] The degree of each node should be represented by varying colors. Pick a meaningful color scheme (hint: color gradients). There should be at least 3 color gradations and it must be visually evident that the nodes with a higher degree use darker/deeper colors and the nodes with lower degrees use lighter colors. You can find example color gradients at Color Brewer. 9 Version 0 4. [6 points] Pinning nodes: a. [2 points] Modify the code so that dragging a node will fix (i.e., “pin”) the node’s position such that it will not be modified by the graph layout algorithm (Note: pinned nodes can be further dragged around by the user. Additionally, pinning a node should not affect the free movement of the other nodes). Node pinning is an effective interaction technique to help users spatially organize nodes during graph exploration. The D3 API for pinning nodes has evolved over time. We recommend reading this post when you work on this sub-question. b. [1 points] Mark pinned nodes to visually distinguish them from unpinned nodes, i.e., show pinned nodes in a different color. c. [3 points] Double clicking a pinned node should unpin (unfreeze) its position and unmark it. When a node is no longer pinned, it should move freely again. IMPORTANT: 1. To pass autograder consistently for part 1 (which tests if a dragged node becomes pinned and retains its position), you may need to increase the radius of highly weighted nodes and reduce their label sizes, so that the nodes can be more easily detected by the autograder’s webdriver mouse cursor. 2. To avoid timeout errors on Gradescope, complete the double click function in part 3 before submitting. 3. If you receive timeout messages for all parts and your code works locally on your computer, verify that you are indeed using the appropriate ids provided in the “add the nodes” section in the skeleton code. 4. D3 v5 does not support the d.fixed method (it was deprecated after D3 v3). For our purposes, it is used as a Boolean value to indicate whether a node has been pinned or not. 5. [1 points] Add GT username: Add your Georgia Tech username (usually includes a mix of letters and numbers, e.g., gburdell3) to the top right corner of the force-directed graph (see example image). The GT username must be a element having the id: “credit” Figure 2: Example of Visualization with pinned node (yellow). Your chart may appear different and can earn full credit if it meets all the stated requirements. 10 Version 0 Q3 [15 points] Line Charts Goal Explore temporal patterns in the BoardGameGeek data using line charts in D3 to compare how the number of ratings grew. Integrate additional data about board game rankings onto these line charts and explore the effect of axis scale choice. Technology D3 Version 5 (included in the lib folder) Chrome v92.0 (or higher): The browser used for grading your code Python HTTP server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. On Gradescope, these libraries are provided for you in the autograder environment. Deliverables [Gradescope] Q3.(html / js / css): The HTML, JavaScript, CSS to render the line charts. Do not include the D3 libraries or boardgame_ratings.csv dataset. Use the dataset in the file boardgame_ratings.csv (in the Q3 folder) to create line charts. Refer to the tutorial for line chart here: Note: You will create four charts in this question, which should be placed one after the other on a single HTML page, like the example image below (Figure 3). Note that your design need NOT be identical to the example; however, the submission must follow the DOM structure specified at the end of this question. IMPORTANT: use the Margin Convention guide for specifying chart dimensions and layout. The autograder will assume this convention has been followed for grading purposes. The SVG viewBox attribute is not recommended to define the position and dimension of your SVG. 1. [5 points] Creating line chart. Create a line chart (Figure 3.1) that visualizes the number of board game ratings from November 2016 to August 2020 (inclusively), for the eight board games: [‘Catan’, ‘Dominion’, ‘Codenames’, ‘Terraforming Mars’, ‘Gloomhaven’, ‘Magic: The Gathering’, ‘Dixit’, ‘Monopoly’]. Use d3.schemeCategory10() to differentiate these board games. Add each board game’s name next to its corresponding line. For the x-axis, show a tick label for every three months. Use D3 axis.tickFormat() and d3.timeFormat() to format the ticks to display abbreviated months and years. For example, Jan 17, Apr 17, Jul 17. (See Figure 3.1 and its x-axis ticks). ● Chart title: Number of Ratings 2016-2020 ● Horizontal axis label: Month. Use D3.scaleTime(). ● Vertical axis label: Num of Ratings. Use a linear scale (for this part). VERY IMPORTANT — Beware of “Silent Date Conversion”: Opening the csv file in an application like Excel may silently modify date strings without warning you, e.g., converting hyphen-separated date strings (e.g., 2016-11-01) into slash-separated date strings (e.g., 11/01/16). Impacted students would see a “correct” line chart visualization on their local computers, but when they upload their code to Gradescope, test cases will fail (e.g., tick labels are not found, lines are not drawn) because the x-scale cannot be computed (as the dates are parsed as NaN). To view the content of a csv file, we recommend you only use text editors (e.g., sublime text, notepad) that do not silently modify csv files. 2. [5 points] Adding board game rankings. Create a line chart (Figure 3.2) for this part (append to the same HTML page) whose design is a variant of what you have created in part 1. Start with your chart from part 1. Modify the code to visualize how the rankings of [‘Catan’, ‘Codenames’, ‘Terraforming Mars’, ‘Gloomhaven’] change over time by adding a symbol with the ranking text on their corresponding lines. Show the symbol for every three months and exactly align with the x-axis ticks in part 1. (See Figure 3.2). Add a legend to explain what this symbol represents next to your chart (See the Figure 3.2 bottom right). 11 Version 0 ● Chart title: Number of Ratings 2016-2020 with Rankings 3. [5 points] Axis scales in D3. Create two line charts (Figure 3.3-1 and 3.3-2) for this part (append to the same HTML page) to try out two axis scales in D3. Start with your chart from part 2. Then modify the vertical axis scale for each chart: the first chart uses the square root scale for its vertical axis (only), and the second chart uses the log scale for its vertical axis (only). Keep the symbols and the symbol legend you implemented in part 2. At the bottom right of the last chart, add your GT username (e.g., gburdell3, see Figure 3.3-2 for example). Note: the horizontal axes should be kept in linear scale, and only the vertical axes are affected. Hint: You may need to carefully set the scale domain to handle the 0s in data. ● First chart (Figure 3.3-1) ○ Chart title: Number of Ratings 2016-2020 (Square root Scale) ○ This chart uses the square root scale for its vertical axis (only) ○ Other features should be the same as part 2. ● Second chart (Figure 3.3-2) ○ Chart title: Number of Ratings 2016-2020 (Log Scale) ○ This chart uses the log scale for its vertical axis (only). Set the y-scale domain minimum to 1. ○ Other features should be the same as part 2. Figure 3.1: Example line chart. Your chart may appear different and can earn full credit if it meets all stated requirements. Figure 3.2: Example of a line chart with rankings. Your chart may appear different and can earn full credit if it meets all stated requirements. 12 Version 0 Figure 3.3-1: Example of a line chart using square root scale. Your chart may appear different and can earn full credit if it meets all stated requirements. Figure 3.3-2: Example of a line chart using log scale. Your chart may appear different and can earn full credit if it meets all stated requirements. 13 Version 0 Note: Your D3 visualization MUST produce the following DOM structure. plot (Q3.1) | +– chart title | +– containing Q3.1 plot elements | +– containing plot lines, line labels | +– x-axis | | | +– (x-axis elements) | | | +– x-axis label | +– y-axis | +– (y-axis elements) | +– y-axis label plot (Q3.2) | +– chart title | +– containing Q3.2 plot elements | | | +– containing plot lines, line labels | | | +– for x-axis | | | | | +– (x-axis elements) | | | | | +– x-axis label | | | +– for y-axis | | | | | +– (y-axis elements) | | | | | +– for y-axis label | | | +– containing plotted symbols, symbol labels | +– containing legend symbol and legend text element(s) plot (Q3.3-1): same as format for Q3.2, with c-1 in ids (e.g., id=”svg-c-1″, etc.) plot (Q3.3-2): same as format for Q3.2, with c-2 in ids (e.g., id=”svg-c-2″, etc.)
Reviews
There are no reviews yet.