BACKGROUND.
Its been already a few weeks since you started your short-term internship in the Data Analytics Department of the start-up OptimiseYourJourney, which will enter the market next year with a clear goal in mind: leverage Big Data technologies for improving the user experience in transportation. Your contribution in Assignment 1 has proven the potential OptimiseYourJourney can obtain by applying MapReduce to analyse large-scale public transportation datasets as the one in the New York City Bike Sharing System: https://www.citibikenyc.com/
In the department meeting that has just finished your boss was particularly happy, again.
- The very same dataset from Assignment 1 (lets call it my_dataset_1) provides an opportunity to leverage other large-scale data analysis libraries, such as Spark SQL.
- The graph structure of the dataset allows you to explore the potential of Spark GraphFrames, a small library of Spark specialised in the parallel execution of classical graph algorithms. To do so, two small graph examples (lets call them my_dataset_2 and my_dataset_3) are provided to explore the classical algorithms of: oDijkstra for finding the shortest path from a source node to the remaining nodes.
o PageRank for assigning a value to each node based on its neighbourhood.
DATASET 1:
This dataset occupies ~80MB and contains 73 files. Each file contains all the trips registered the CitiBike system for a concrete day:
- csv => All trips registered on the 1st of May of 2019.
- csv => All trips registered on the 2nd of May of 2019.
- csv => All trips registered on the 12th of July of 2019.
Altogether, the files contain 444,110 rows. Each row contains the following fields:
start_time , stop_time , trip_duration , start_station_id , start_station_name , start_station_latitude , start_station_longitude , stop_station_id , stop_station_name , stop_station_latitude , stop_station_longitude , bike_id , user_type , birth_year , gender , trip_id
- (00) start_time
! A String representing the time the trip started at.
<%Y/%m/%d %H:%M:%S>
!Example: 2019/05/02 10:05:00
- (01) stop_time
! A String representing the time the trip finished at.
<%Y/%m/%d %H:%M:%S>
!Example: 2019/05/02 10:10:00
- (02) trip_duration
!An Integer representing the duration of the trip.
!Example: 300
- (03) start_station_id
!An Integer representing the ID of the CityBike station the trip started from.
!Example: 150
- (04) start_station_name
!A String representing the name of the CitiBike station the trip started from.
!Example: E 2 St &; Avenue C.
- (05) start_station_latitude
!A Float representing the latitude of the CitiBike station the trip started from.
!Example: 40.7208736
- (06) start_station_longitude
!A Float representing the longitude of the CitiBike station the trip started from.
! Example: -73.98085795
- (07) stop_station_id
!An Integer representing the ID of the CityBike station the trip stopped at.
!Example: 150
- (08) stop_station_name
! A String representing the name of the CitiBike station the trip stopped at.
!Example: E 2 St &; Avenue C.
- (09) stop_station_latitude
!A Float representing the latitude of the CitiBike station the trip stopped at.
!Example: 40.7208736
- (10) stop_station_longitude
!A Float representing the longitude of the CitiBike station the trip stopped at.
! Example: -73.98085795
- (11) bike_id
! An Integer representing the id of the bike used in the trip.
! Example: 33882.
- (12) user_type
! A String representing the type of user using the bike (it can be either Subscriber or Customer).
!Example: Subscriber.
- (13) birth_year
! An Integer representing the birth year of the user using the bike.
! Example: 1990.
- (14) gender
! An Integer representing the gender of the user using the bike (it can be either 0 => Unknown; 1 => male; 2 => female).
! Example: 2.
- (15) trip_id
! An Integer representing the id of the trip.
! Example: 190.
DATASET 2:
This dataset consists in the file tiny_graph.txt, which contains 26 edges (indeed, 13 edges, one on each direction) in a graph with 8 nodes.
DATASET 3:
This dataset consists in the file tiny_graph.txt, which contains 18 edges (indeed, 9 edges, one on each direction) in a graph with 6 nodes.
TASKS / EXERCISES.
The tasks / exercises to be completed as part of the assignment are described in the next pages:
- The following exercises are placed in the folder my_code:
- A02_Part1/py
- A02_Part2/py 3. A02_Part3/A02_Part3.py
- A02_Part4/py
- A02_Part5/py
Marks are as follows:
- A02_Part1/py => 18 marks
- A02_Part2/py => 18 marks 3. A02_Part3/A02_Part3.py => 18 marks
- A02_Part4/py => 18 marks
- A02_Part5/py => 28 marks
Tasks:
- A02_Part1/py
- A02_Part2/py
- A02_Part3/py
Complete the function my_main of the Python program.
Do not modify the name of the function nor the parameters it receives.
The entire work must be done within Spark SQL:
- The function my_main must start with the creation operation read above loading the dataset to Spark SQL.
- The function my_main must finish with an action operation collect, gathering and printing by the screen the result of the Spark SQL job.
- The function my_main must not contain any other action operation collect other than the one appearing at the very end of the function.
- The resVAL iterator returned by collect must be printed straight away, you cannot edit it to alter its format for printing.
- A02_Part4/A02_Part4.py
Complete the function compute_page_rank of the Python program.
Do not modify the name of the function nor the parameters it receives.
The function must return a dictionary with (key, value) pairs, where:
- Each key represents a node id.
- Each value represents the pagerank value computed for this node id.
- A02_Part5/A02_Part5.py
Complete the function my_main of the Python program.
Do not modify the name of the function nor the parameters it receives.
The entire work must be done within Spark SQL:
- The function my_main must start with the creation operation read above loading the dataset to Spark SQL.
- The function my_main must finish with an action operation collect, gathering and printing by the screen the result of the Spark SQL job.
- The function my_main must not contain any other action operation collect other than the one appearing at the very end of the function.
- The resVAL iterator returned by collect must be printed straight away, you cannot edit it to alter its format for printing.
Reviews
There are no reviews yet.