: An important and critical phase of the data-science process is data collection. Several organizations including the federal government (data.gov) have their data available to the public for various purposes. Social network applications such as Twitter and Facebook collect enormous amount of data contributed by their numerous and prolific user. For other businesses such as Amazon and NYTimes data is a significant and valuable byproduct of their main business. Nowadays everybody has data. Most of these data generator businesses make subset of their data available for use by registered users for free. Some of them as downloadable data files (.csv, .xlsx) as a database (.db, .db3). Sometimes the data that needs to be collected is not in a specific format but is available as a web page content. In this case, typically a web crawler is used to crawl the web (pages) and scrap the data from these web pages and extract the information needed. Data generating organizations have realized the need to share at least a subset of their data with users interested in developing applications. Entire data sets are sold as products.
What is an API? Why is this so important? A standard, secure and programmatic access to data is provided through an Application Programming Interface (API). An API offers a method for one or two way communication among software (as well as hardware) components as long as they carry the right credentials. These credentials for authentication for programmatic access is defined by another standard OAuth (Open Authentication) delegation protocol [6].
LAB 1: WHAT TO DO?
Preparation: Here are the preliminary requirements for the lab.
- Work environment: You will working on Jupyter with R kernel. Install Jupyter and R kernel as instructed by the handout [4]. This will be our Learning Environment. Later on we will explore Development Environment in RStudio;
- Create an account on twitter as a user as well as a developer. In the developer site, the tab MyApps is of particular interest. (After you create a twitter developer account,) you will click on MyApps to create a new app called Lab1. Fill in the required fields as per the instructions given there. Once you submit you should be able to get the OAuth credentials [9] that had four parts: Customer API key, Customer API secret, Access Token Key and Access Token Secret. All are needed for programmatically working with Twitter. (Yes, you can auto-tweet, if you know what I mean 😉
- R community has created a package for working with Twitter data called twitteR. Read the vignette by Jeff Gentry [12] about the package he contributed. Work on the vignette.
Part 1: Complete the R-handout [5] discussed in lecture on Jupyter and submit the notebook generated.
Part 2: Learning by Repeating Topical Data Analysis
- Navigate to the CDC site of flu data and analysis, flu.gov [11] and fluview [10]. Take a few minutes to review the contents and understand the context.
- The fluview site discusses many variables or features that affect the reality that is playing out in front of our eyes.
- Browse through the page, identify and review the charts and data (plain tables, .csv) under each chart; here are the ones with data.
- Influenza national summary (green and yellow chart)
- Positive tested
- Influenza sub-type pie-charts
- Mortality
- Pediatric deaths
- Influenza-like illness
- Flu heat map of USA (Required)
- Now repeat at least five of the seven flu charts discussed in the flu report for the week of Jan 27th, 2018. Beware, by the time we review it, it might have moved on to the next week and the data and the charts may be different. The last one, heat map is required for part 3 of your lab.
Part 3: Twitter Application Development
We will develop applications that are data clients for twitter data. Twitter supports many APIs, we will use Search API that is a part of the REST API.
- Learning Jupyter, R and twitteR: All these can be achieved by one activity: working with twitteR package library vignette. Type in the R language instructions for each example discussed: try it with different names and twitter hash tags. For our context we need query words to be associated with influenza or flu. Apply your domain knowledge or intuition to query for an effective search word to get most closely tracking tweets for flu.
- We are NOT interested in sentiment analysis. We are interested in sheer number of tweets on a topic that is associated with flu or a related term that you uniquely determine that will be important influencer. You have to choose a good topic. Understand the Search API that we are using for can give you only limited number of tweets per day and also only a sampling of the all the tweets. You will collect at least 20000 tweets (HmmHow could we categorize them?). Group them by geo-location as in Google maps API (one more API) and plot them on the map of USA. Map the geolocations to states, and color the states according to the number of tweets or mentions per state.
- Input: Search word or hash tag related to flu. Data client processing: Obtain and group tweets by location into categories mentioned in fluview. Output: plot them on the states/ color the states by the strength.
Issue 1: Of course, there is an issue with location meta-data. This is not available (N/A) if the user does hide his/her location. Only 1% of the tweets have geolocations. Then how can we get set of locations?
Here is a verified approach using function of twitteR
- Convert search result tweets into dataframe
- Lookup screen names from this dataframe
- Convert these screen names into another dataframe
- Keep only users (user names) with location info
- Get the geocode of the locations from this dataframe
- Hints on TwitteR functions you may need: twListToDF, lookupUsers, geocode; Look up these functions in the twitteR manual.
- Compare: Now compare your map and with map from CDC. You can do that side by side in a Jupyter notebook by running the respective R commands.
- Iterate: Try the comparison with different query words related to flu. Keep your words secret.
- Bundle all the work in a Jupyter notebook for submission, even if you work on RStudio. Make sure your document your application development using markdowns.
Reviews
There are no reviews yet.