, , , , , , , ,

[SOLVED] Cse6242 – data and visual analytics – hw 3 –

$25

File Name: Cse6242_–_data_and_visual_analytics__–_hw_3_–.zip
File Size: 480.42 KB

5/5 - (1 vote)

Spark, Docker, DataBricks, AWS and GCP Download the HW3 Skeleton before you begin.  Many modern-day datasets are huge and truly exemplify “big data”. For example, the Facebook social graph is petabytes large (over 1M GB); every day, Twitter users generate over 12 terabytes of messages; and the NASA Terra and Aqua satellites each produce over 300 GB of MODIS satellite imagery per day. These raw data are far too large to even fit on the hard drive of an average computer, let alone to process and analyze. Luckily, there are a variety of modern technologies that allow us to process and analyze such large datasets in a reasonable amount of time. For the bulk of this assignment, you will be working with a dataset of over 1 billion individual taxi trips from the New York City Taxi & Limousine Commission (TLC). Further details on this dataset are available here.In Q1, you will work with a subset of the TLC dataset to get warmed up with PySpark. Apache Spark is a framework for distributed computing, and PySpark is its Python API. You will use this tool to answer questions such as “what are the top 10 most common trips in the dataset”? You will be using your own machine for computation, using an environment defined by a Docker container.In Q2, you will perform further analysis on a different subset of the TLC dataset using Spark on DataBricks, a platform combining datasets, machine learning models, and cloud compute. This part of the assignment will be completed in the Scala programming language, a modern general-purpose language with a robust support for functional programming. The Spark distributed computing framework is in fact written using Scala.In Q3, you will use PySpark on AWS using Elastic MapReduce (EMR), and in Q4 you will use Spark on Google Cloud Platform, to analyze even larger samples from the TLC dataset.Finally, in Q5 you will use the Microsoft Azure ML Studio to implement a regression model to predict automobile prices using a sample dataset already included in the Azure workspace. A main goal of this assignment is to help students gain exposure to a variety of tools that will be useful in the future (e.g., future project, research, career). The reasoning behind intentionally including AWS, Azure and GCP (most courses use only one), because we want students to be able to try and compare these platforms as they evolve rapidly. This will help the students in the future should they need to select a cloud platform to use, they can make more informed decisions and be able to get started right away.You will find that a number of computational tasks in this assignment are not very difficult, and there seems to be quite a bit of “setup” to do before getting to the actual “programming” part of the problem. The reasoning behind this design is because for many students, this assignment is the very first time they use any cloud services; they are new to the pay-per-use model, and they have never used a cluster of machines. There are over 1000 students in CSE 6242 (campus and online) and CX 4242 combined. This means we have students coming from a great variety of backgrounds. We wished we could provide every student unlimited AWS credit so that they can try out many services and write programs that are more complex. Over the past offering of this course, we have been gradually increasing the “programming” part and reducing much of the “setup” (e.g., the use of Docker, Databricks and Jupyter notebooks were major improvements). We will continue to further reduce the setup that students need to perform in future offerings of this course.Follow these instructions to download and setup a preconfigured Docker image that you will use for this assignment. Why use Docker? In earlier iterations of this course, students installed software on their own machines, and we (both students and instructor team) ran into all sorts of issues, some of which could not be resolved satisfactorily. Docker allows us to distribute a cross platform, preconfigured image with all of the requisite software and correct package versions. Once Docker is installed and the container is running, access Jupyter by browsing to http://localhost:6242. There is no need to install any additional Java or PySpark dependencies as they are all bundled as part of the Docker container. Imagine that your boss gives you a large dataset which contains trip information of New York City Taxi and Limousine Commission (TLC). You are asked to provide summaries for the most common trips, as well as information related to fares and traffic. This information might help in positioning taxis depending on the demand at each location. You are provided with a Jupyter notebook (q1.ipynb) file which you will complete using PySpark using the provided Docker image. Be sure to save your work often! If you do not see your notebook in Jupyter then double-check that the file is present in the folder and double check that your Docker has been set up correctly. If, after checking both, the file still does not appear in Jupyter then you can still move forward by clicking the “upload” button in the Jupyter notebook and uploading the file – however, if you use this approach then your file will not be saved to disk when you save in Jupyter, so you would need to download your work by going to File > Download as… > Notebook (.ipynb), so be sure to download as often as you would normally save!Note: You will use the yellow_tripdata_2019-01_short.csv dataset. This dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts. When processing the data or performing calculations, do not round any values. Download the data here.   While calculating trip_rate, first get the average trip_distance and the average total_amount for each pair of PULocationID and DOLocationID (using group by). Then take their ratio to get the trip_rate for a pickup-drop pair. Example:  Sample Output:  (passenger_count) greater than 0. Calculate the average fare and tip (tip_amount) for all passenger group sizes and calculate the tip percent (tip_amount * 100 / fare_amount). Sort the result in descending order of tip percent to obtain the group size that tips the most generously. Output: alphabetically. A day with low average speed indicates high levels of traffic. The average speed may be 0, indicating very high levels of traffic. Not all days of the week may be present in the data (do not include these missing days of the week in your output). Use date_format along with the appropriate pattern letters to format the day of the week such that it matches the example output below.   Output: TutorialFirstly, go over this Spark on Databricks Tutorial, to learn the basics of creating Spark jobs, loading data, and working with data. You will analyze nyc-tripdata.csv[1] using Spark and Scala on the Databricks platform. (A short description of how Spark and Scala are related can be found here.) You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyctripdata dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts.  Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups.   Note: The day of week should be a string with its full name, for example, “Monday” instead of a              number 1 or “Mon”.  List the results of the above tasks in the provided q2_results.csv file under the relevant sections. These preformatted sections also show you the required output format from your Scala code with the necessary columns — while column names can be different, their resulting values must be correct.VERY IMPORTANT: Use Firefox, Safari or Chrome when configuring anything related to AWS. You will try out PySpark for processing data on Amazon Web Services (AWS). Here you can learn more about PySpark and how it can be used for data analysis. You will be completing a task that may be accomplished using a commodity computer (e.g., consumer-grade laptops or desktops). However, we would like you to use this exercise as an opportunity to learn distributed computing on Amazon EC2, and to gain experience that will help you tackle more complex problems. The services you will primarily be using are Amazon S3 storage, Amazon Elastic Cloud Computing (EC2) virtual servers, and Amazon Elastic MapReduce (EMR) managed Hadoop framework. You will be creating an S3 bucket, running code through EMR, and then storing the output into that S3 bucket. For this question, you will only use up a very small fraction of your AWS credit. If you have any issues with the AWS credits or educate account, please fill out this form.Please read the AWS Setup Tutorialto set up your AWS account. Instructions are provided both as a written guide, and a video tutorial.In this question, you will use a dataset of trip records provided by the New York City Taxi and Limousine Commission (TLC). Further details on this dataset are available here and here.  From these pages [1] [2], you can explore the structure of the data, however you will be accessing the dataset directly through AWS via the code outlined in the homework skeleton. You will be working with two samples of this data, one small, and one much larger. EXTREMELY IMPORTANT: Both the datasets are in the US East (N. Virginia) region. Using machines in other regions for computation would incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon, which is the default). This is extremely important, otherwise your code may not work, and you may be charged extra.You work at NYC TLC, and since the company bought a few new taxis, your boss has asked you to locate potential places where taxi drivers can pick up more passengers. Of course, the more profitable the locations are, the better. Your boss also tells you not to worry about short trips for any of your analysis, so only analyze trips which are 2.0 miles or longer. First, find the 20 most popular drop off locations in the Manhattan borough by finding which of these destinations had the greatest passenger count. Now, analyze all pickup locations, regardless of borough. Bear in mind, your boss is not as savvy with the data as you are and is not interested in location IDs. To make it easy for your boss, provide the Borough and Zone for each of the top 20 pickup locations you determined. To help you evaluate the correctness of your output, we have provided you with the output for the small dataset. Keep in mind that the small dataset and its output can be thought of as only a single “test case” for the large dataset and cannot test for all possible scenarios for the large dataset. That is, running code on the small dataset and producing expected results does NOT necessarily mean the same code will produce the correct results for the large dataset. Note: Please strictly follow the formatting requirements for your output as shown in the small dataset output file. You can use https://www.diffchecker.com/ to make sure the formatting is correct.You are provided with a python notebook (q3_pyspark.ipynb) file which you will complete and load into EMR. You are provided with the load_data() function, which loads two PySpark DataFrames. The first is trips which contains a DataFrame of trip data, where each record refers to one (1) trip. The second is lookup which maps a LocationID to its information. It can be linked to either the PULocationID or DOLocationID fields in the trips DataFrame. The following functions must be completed for full credit. VERY IMPORTANT   Once you have implemented all these functions, run the main() function, which is already implemented, and update the line of code to include the name of your output s3 bucket and a location. This function will fail if the output directory already exists, so make sure to change it each time you run the function. Example: final.write.csv(‘s3://cse6242-bburdell3/output-large3) Your output file will appear in a folder in your s3 bucket as a csv file with a name which is similar to part0000-4d992f7a-0ad3-48f8-8c72-0022984e4b50-c000.csv. Download this file and rename it to q3_output.csv for submission. Do not make any other changes to the file. Hint: Refer to commands such as filter, join, groupBy, agg, limit, sort, withColumnRenamed and withColumn.  Note: Create and use a free workspace instance on Azure ML Studio. Please use your Georgia Tech username (e.g., jdoe3) to login.Primary purpose of this question is to introduce you to Microsoft Azure Machine Learning Studio, familiarize you to its basic functionalities and typical machine learning workflows. Go through the Automobile price prediction tutorial and create/run ML experiments to complete the following tasks.  You will not incur any cost if you save your experiments on Azure. Once you are sure about the results and have reported them, feel free to delete your experiments.You will manually modify the given file q5_results.csv by adding to it the results from the following tasks(e.g., using a plain text editor). e,  To summarize, for part d, you MUST exactly follow each step below to run the experiment: Figure: Property Tab of Partition and Sample Module Hint: For part 4, follow each of the outline steps carefully. This should result in 5 blocks in your final workflow (including the Automobile price data (Raw) block). [1] Graph derived from the NYC Taxi and Limousine Commission

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Cse6242 – data and visual analytics – hw 3 –
$25