[Solved] SOFT8033 ASSIGNMENT 2-SPARK SQL & SPARK GRAPHFRAMES

$25

File Name: SOFT8033_ASSIGNMENT_2_SPARK_SQL___SPARK_GRAPHFRAMES.zip
File Size: 480.42 KB

SKU: [Solved] SOFT8033 ASSIGNMENT 2-SPARK SQL & SPARK GRAPHFRAMES Category: Tag:
5/5 - (1 vote)

BACKGROUND.

Its been already a few weeks since you started your short-term internship in the Data Analytics Department of the start-up OptimiseYourJourney, which will enter the market next year with a clear goal in mind: leverage Big Data technologies for improving the user experience in transportation. Your contribution in Assignment 1 has proven the potential OptimiseYourJourney can obtain by applying MapReduce to analyse large-scale public transportation datasets as the one in the New York City Bike Sharing System: https://www.citibikenyc.com/

In the department meeting that has just finished your boss was particularly happy, again.

  • The very same dataset from Assignment 1 (lets call it my_dataset_1) provides an opportunity to leverage other large-scale data analysis libraries, such as Spark SQL.
  • The graph structure of the dataset allows you to explore the potential of Spark GraphFrames, a small library of Spark specialised in the parallel execution of classical graph algorithms. To do so, two small graph examples (lets call them my_dataset_2 and my_dataset_3) are provided to explore the classical algorithms of: oDijkstra for finding the shortest path from a source node to the remaining nodes.

o PageRank for assigning a value to each node based on its neighbourhood.

DATASET 1:

This dataset occupies ~80MB and contains 73 files. Each file contains all the trips registered the CitiBike system for a concrete day:

  • csv => All trips registered on the 1st of May of 2019.
  • csv => All trips registered on the 2nd of May of 2019.
  • csv => All trips registered on the 12th of July of 2019.

Altogether, the files contain 444,110 rows. Each row contains the following fields:

start_time , stop_time , trip_duration , start_station_id , start_station_name , start_station_latitude , start_station_longitude , stop_station_id , stop_station_name , stop_station_latitude , stop_station_longitude , bike_id , user_type , birth_year , gender , trip_id

  • (00) start_time

! A String representing the time the trip started at.

<%Y/%m/%d %H:%M:%S>

!Example: 2019/05/02 10:05:00

  • (01) stop_time

! A String representing the time the trip finished at.

<%Y/%m/%d %H:%M:%S>

!Example: 2019/05/02 10:10:00

  • (02) trip_duration

!An Integer representing the duration of the trip.

!Example: 300

  • (03) start_station_id

!An Integer representing the ID of the CityBike station the trip started from.

!Example: 150

  • (04) start_station_name

!A String representing the name of the CitiBike station the trip started from.

!Example: E 2 St &; Avenue C.

  • (05) start_station_latitude

!A Float representing the latitude of the CitiBike station the trip started from.

!Example: 40.7208736

  • (06) start_station_longitude

!A Float representing the longitude of the CitiBike station the trip started from.

! Example: -73.98085795

  • (07) stop_station_id

!An Integer representing the ID of the CityBike station the trip stopped at.

!Example: 150

  • (08) stop_station_name

! A String representing the name of the CitiBike station the trip stopped at.

!Example: E 2 St &; Avenue C.

  • (09) stop_station_latitude

!A Float representing the latitude of the CitiBike station the trip stopped at.

!Example: 40.7208736

  • (10) stop_station_longitude

!A Float representing the longitude of the CitiBike station the trip stopped at.

! Example: -73.98085795

  • (11) bike_id

! An Integer representing the id of the bike used in the trip.

! Example: 33882.

  • (12) user_type

! A String representing the type of user using the bike (it can be either Subscriber or Customer).

!Example: Subscriber.

  • (13) birth_year

! An Integer representing the birth year of the user using the bike.

! Example: 1990.

  • (14) gender

! An Integer representing the gender of the user using the bike (it can be either 0 => Unknown; 1 => male; 2 => female).

! Example: 2.

  • (15) trip_id

! An Integer representing the id of the trip.

! Example: 190.

DATASET 2:

This dataset consists in the file tiny_graph.txt, which contains 26 edges (indeed, 13 edges, one on each direction) in a graph with 8 nodes.

DATASET 3:

This dataset consists in the file tiny_graph.txt, which contains 18 edges (indeed, 9 edges, one on each direction) in a graph with 6 nodes.

TASKS / EXERCISES.

The tasks / exercises to be completed as part of the assignment are described in the next pages:

  • The following exercises are placed in the folder my_code:
    1. A02_Part1/py
    2. A02_Part2/py 3. A02_Part3/A02_Part3.py
    3. A02_Part4/py
    4. A02_Part5/py

Marks are as follows:

  1. A02_Part1/py => 18 marks
  2. A02_Part2/py => 18 marks 3. A02_Part3/A02_Part3.py => 18 marks
  3. A02_Part4/py => 18 marks
  4. A02_Part5/py => 28 marks

Tasks:

  1. A02_Part1/py
  2. A02_Part2/py
  3. A02_Part3/py

Complete the function my_main of the Python program.

Do not modify the name of the function nor the parameters it receives.

The entire work must be done within Spark SQL:

  • The function my_main must start with the creation operation read above loading the dataset to Spark SQL.
  • The function my_main must finish with an action operation collect, gathering and printing by the screen the result of the Spark SQL job.
  • The function my_main must not contain any other action operation collect other than the one appearing at the very end of the function.
  • The resVAL iterator returned by collect must be printed straight away, you cannot edit it to alter its format for printing.
  1. A02_Part4/A02_Part4.py

Complete the function compute_page_rank of the Python program.

Do not modify the name of the function nor the parameters it receives.

The function must return a dictionary with (key, value) pairs, where:

  • Each key represents a node id.
  • Each value represents the pagerank value computed for this node id.
  1. A02_Part5/A02_Part5.py

Complete the function my_main of the Python program.

Do not modify the name of the function nor the parameters it receives.

The entire work must be done within Spark SQL:

  • The function my_main must start with the creation operation read above loading the dataset to Spark SQL.
  • The function my_main must finish with an action operation collect, gathering and printing by the screen the result of the Spark SQL job.
  • The function my_main must not contain any other action operation collect other than the one appearing at the very end of the function.
  • The resVAL iterator returned by collect must be printed straight away, you cannot edit it to alter its format for printing.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] SOFT8033 ASSIGNMENT 2-SPARK SQL & SPARK GRAPHFRAMES
$25