Project Overview
The goal of the final project is for you to showcase what youve learned in 507 regarding:
- Accessing data via web APIs, including those that require authentication
- Accessing data via scraping
- Accessing data efficiently and responsibly using caching
- Using a database to store and access relational data
- Using basic python data structures and operations to analyze and process data in interesting ways
- Using unit tests to verify that data access, storage, and processing works as designed
- Using a presentation tool or framework to present data to a user
- Supporting basic interactivity by allowing a user to choose among different data presentation options
Here are a couple of examples that would be reasonable final projects:
- A program that lets a user choose a city and see the average ratings for different restaurant types (e.g., bar, breakfast, Indian, Mediterranean) from Google, Yelp, and OpenTable as plotly bar or scatter charts.
- A program that aggregates crime data from https://spotcrime.com/mi/ann+arbor/daily and allows a user to select one or more crime types to see a graph of crime frequency by month, either for a single year comparing across several years. Data is displayed using HTML tables within a Flask App.
Project Components
There are several components that your project must contain. Each of these are detailed in this section.
Data Sources
You must select data sources that, combined together, give you a challenge score of at least 8. Additionally, you must use either a Web API that requires authorization or a website where you crawl and scrape multiple pages as one of your data sources (these options are marked with below). Heres how the scoring works:
Data Source | Example | Challenge Score*** |
Web API youve used before | Twitter, iTunes, newsapi.org | 2 |
Web API you havent used before that requires no authorization | Wikipedia, Google Books | 3 |
Web API you havent used before that requires API key or HTTP Basic authorization | Yelp Fusion, Open Movie Database | 4 |
Web API you havent used before that requires OAuth | Open Table, Reddit, Facebook, many more | 6 |
Scraping a page/site youve worked with before** | nps.gov, si.umich.edu | 1 |
Scraping a new single page** | So many! | 4 |
Crawling [and scraping] multiple pages in a site you havent used before | So many! | 8 |
CSV or JSON file you havent used before with > 1000 records | Dataset from data.gov | 2 |
Multiple related CSV or JSON files with at least one file containing > 1000 records | Python Questions from Stack Overflow | 4 |
**: If you choose scraping a new single page you can only use this option for one of your project sources (i.e., you cant scrape 2 pages you havent scraped before and count it as 8 challenge points).
***: The challenge scores listed here are a guideline, but specific sources may be determined to be more or less challenging depending on the details of the source and how youre planning to use it.
: You must use at least one of these options as one of your data sources.
From each source, also need to capture at least 100 records (for CSV/JSON sources you need to capture at least 1000), and each record must have at least 5 fields associated with it.
If you have a source youd like to use that you dont think fits neatly into one of these categories, consult with your GSI.
Data Access and Storage
You will need to create a database to store your data. Your database must have at least two tables, and there must be at least one relation (primary key foreign key) between the two tables. Your data processing code (see below) must draw data from the database (i.e., not from the API/web page/CSV or from the cache).
If you are working with APIs or web pages you must also cache the raw results (JSON or HTML) you fetch from the source. Your code that writes data into your database must go through the cache when building the database.
As part of grading, we may use your code to rebuild the database (so this should be an option supported by your code) or ask you demonstrate this capability.
Data Processing
This is largely up to you, but you need to do whatever is necessary to support the data presentation(s) your program provides. This will probably involve things like creating dictionaries to collect sums or averages within a category (e.g., instances of crime by type, review scores by restaurant type).
Unit Testing
You must write unit tests to show that the data access, storage, and processing components of your project are working correctly. You must create at least 3 test cases and use at least 15 assertions or calls to fail( ). Your tests should show that you are able to access data from all of your sources, that your database is correctly constructed and can satisfy queries that are necessary for your program, and that your data processing produces the results and data structures you need for presentation.
Data Presentation
Use a tool or framework to present data to users on demand. The data should be presented in some way other than print( ) statements that output to the terminal. Your program must be able to produce at least 4 different graphs/displays/presentations. These can be different groupings of data, different graph types, or can differ in other ways (if youre not sure if theyre different enough, check with your GSI).
The two options we cover in class that you are most likely to want to use include:
- Provide an interactive command line prompt for user to choose data/visualization options. Display selected graphs using plotly.
- Create a Flask App that uses HTML links/form elements to prompt for the user to choose data/visualization options. Display selected data using HTML tables (or other elements, as long as the output looks good).
- If youre feeling ambitious, you can figure out how to use plotly with Flask.
If you wish to use a different data presentation approach, you should check with your GSI.
What to Submit
You will be submitting two key things: the project proposal, that will help us give you feedback if your scope needs adjusting, and the final code.
Proposal
Due Nov 19
Describe the idea of what you are going to do and list the data sources (tech points) you are going to use. Submit a half page to one page single spaced proposal describing your project plan. Your proposal should include:
- The description of what your program is intended to do. What is its purpose and who is it aimed at?
- The data sources you intend to use, along with your self-assessment of the challenge score represented by your data source selection.
- The presentation options you plan to support (what information are you intending to display to users).
- The presentation tool(s) you plan to use.
Your GSI will review your proposal draft and potentially provide feedback (if the proposal is fine, you may not get much feedback!).
After submitting your proposal, any significant changes (e.g., to data sources or presentation plans) will require submission of a Final Project Proposal Revision via Canvas. Follow instructions there for how to notify your GSI that you have submitted a revision. It is strongly recommended that you discuss any proposed changes with your GSI before submitting a revision.
If you change your proposal plan without submitting a revision and obtaining authorization, you risk losing lots of points on your final project grade.
Proposal Rubric (30 points)
Poor | OK | Good | ||||
Data sources | Sources are not identified, or are described very poorly. | 0 | Sources are identified, but some information is missing | 4-8 | Sources are clearly identified, with URLs linking to a description of the source | 10 |
Data source challenge score | Data source challenge score is not provided | 0 | Data source challenge score is provided by has errors or does not meet criteria (total >=8) | 4-8 | Data source challenge score is provided, correct, and meets criteria | 10 |
Presentation options identified | Not identified | 0 | Presentation identified, but are not clear, are not sufficiently different, or do not meet criteria (options >= 4) | 4-8 | Presentation options are identified, different, and meet criteria | 10 |
Presentation tools identified | Not identified | 0 | Tools are identified, but there are some issues with clarity. | 4-8 | Tools are identified and are appropriate | 10 |
Final Project Submission and Demo
Due Dec 9
Via Canvas, You must submit a link to a GitHub repository containing your final submission. Your GitHub repo must contain a README.md file that gives an overview of your project, including:
- Data sources used, including instructions for a user to access the data sources
- Any other information needed to run the program (e.g., pointer to getting started info for plotly)
- Brief description of how your code is structured, including the names of significant data processing functions (just the 2-3 most important functionsnot a complete list) and class definitions. If there are large data structures (e.g., lists, dictionaries) that you create to organize your data for presentation, briefly describe them.
- Brief user guide, including how to run the program and how to choose presentation options.
Your GitHub repo must also contain a requirements.txt file that can be used by the teaching team to set up a virtual environment in which to run your project.
Do not check in any private or secret information (e.g., API keys, passwords), but if you are using an API that requires authentication, please submit authentication information through canvas so that we dont have to apply for an account.
Demo Sessions
You will sign up to give a short (< 5-minute) demo to your GSI, following a script that we will provide as the deadline approaches. We are planning to hold demo sessions during the class and discussion session at the last week of class (Week of Dec 9). You will get notified if we decided to change the form of demo. Note that the project is due at 11:59pm on Monday, 12/9, so you should have your project finished by the Tuesday morning class.
If you are unable to attend a demo session during the scheduled times, please contact the teaching team as soon as possible to make alternative arrangements.
Reviews
There are no reviews yet.