[Solved] BigData Assignment1-SPARK CORE & SPARK SQL

$25

File Name: BigData_Assignment1-SPARK_CORE_&_SPARK_SQL.zip
File Size: 395.64 KB

SKU: [Solved] BigData Assignment1-SPARK CORE & SPARK SQL Category: Tag:
5/5 - (1 vote)

From the window you can see the sun shinning in a lovely autumn morning. Its Monday, 10am, and you are in the open plan office of a new start-up, OptimiseYourJourney, which will enter the market next year with a clear goal in mind: leverage Big Data technologies for improving the user experience in transportation.

The start-up is at a very early stage, and has no clear product in mind yet. However, they have offered a short-term internship in their Big Data Engineering Department to help them exploring the datasets, technologies and techniques that can be applied in their future products. They do not pay very well (0 per hour), but you see this as a good opportunity to complement your knowledge in the module Big Data Processing you are studying at the moment, so you have decided to give it a go.

In the department meeting that has just finished your boss was particularly excited. During the weekend he was exploring the public datasets provided by the Irish government at https://data.gov.ie/ and he found a transportation dataset named Dublin Bus GPS sample data from Dublin City Council (Insight Project): https://data.gov.ie/dataset/dublin-busgps-sample-data-from-dublin-city-council-insight-project The original dataset contains 40+ millions of GPS data measurements collected from Dublin buses operating in January of 2013. Once extracted, it occupies ~5GB and is available at: http://opendata.dublincity.ie/TrafficOpenData/sir010113-310113.zip

Your boss thinks this dataset provides a great opportunity to explore the potential of Spark Core and Spark SQL in analysing large datasets. He has already cleaned the dataset for you to perform some data analysis on it.

DATASET.

The new dataset (obtained after cleaning the original one) is provided to you in the folder Canvas => 5_Assignments => A01_dataset.zip.

It occupies ~3GB and contains 744 files, one per hour interval and day of the month:

  • 2013010100.csv => Provides the data measurements of 01st of January 2013 in the hour interval [12am, 1am)
  • 2013010101.csv => Provides the data measurements of 01st of January 2013 in the hour interval [1am, 2am)
  • 2013010108.csv => Provides the data measurements of 01st of January 2013 in the hour interval [8am, 9am)
  • 2013010109.csv => Provides the data measurements of 01st of January 2013 in the hour interval [9am, 10am)
  • 2013010123.csv => Provides the data measurements of 01st of January 2013 in the hour interval [11pm, 12am)
  • 2013010200.csv => Provides the data measurements of 02nd of January 2013 in the hour interval [12am, 1am)
  • 2013013123.csv => Provides the data measurements of 31st of January 2013 in the hour interval [11pm, 12am)

Each row of a file contains the following fields:

Date , Bus_Line , Bus_Line_Pattern , Congestion , Longitude , Latitude , Delay , Vehicle , Closer_Stop , At_Stop

  • (00) Date

A String represents the date of the measurement with format

<%Y-%m-%d %H:%M:%S>

Example: 2013-01-01 13:00:02 (01) Bus_Line

An Integer representing the bus line.

Example: 120

  • (02) Bus_Line_Pattern

A String identifier for the sequence of stops scheduled in the bus line (different buses of the same bus line can follow different sequence of stops in different iterations).

Example: 027B1001 (it can also be empty ). (03) Congestion

An Integer representing whether the bus is at a traffic jam (No => 0 / Yes => 1) .

Example: 0

  • (04) Longitude

A Float representing the longitude position of the bus.

Example: -6.269634 (05) Latitude

A Float representing the latitude position of the bus. Example: 53.360504

  • (06) Delay

An integer representing the delay of the bus with respect to its schedule (measured in seconds). It is a negative value if the bus is ahead of schedule.

Example: 90. (07) Vehicle

An integer identifier for the bus vehicle. Example: 33304. (08) Closer_Stop

An integer identifier for the closest bus stop. Example: 7486.

  • (09) At_Stop_Stop

An integer representing whether the bus vehicle is at the bus stop right now (i.e., stopping at it for passengers to hop on / hop off). (No -> 0 and Yes -> 1)

Example: 0.

TASKS / EXERCISES.

The tasks / exercises to be completed as part of the assignment are described in the next pages of the assignments.

  • The following exercises are placed in the folder my_code:
    1. py
    2. py
    3. py
    4. py
    5. py
    6. py
    7. py
    8. py

Each exercise is worth 10 marks.

On each exercise, your task is to complete the function my_main of the Python program. This function is in charge of specifying the Spark Job performing the desired data analysis.

When programming my_main, you can create and call as many auxiliary functions as you need.

Do not alter the Python main entry point (Python Execution Section) of the file.

Do not alter the parameters passed to the function my_main.

The function my_main must start with a creation operation textFile or read loading the dataset to Spark Core and Spark SQL, respectively.

The function my_main must finish with an action operation collect gathering and printing by the screen the result of Spark Core / Spark SQL job.

  • The following exercise is placed in the folder my_report:
  1. A01_report.odt

The report is worth 20 marks.

The report must contain a maximum of 1,000 words.

SUBMISSION DETAILS / SUBMISSION CODE OF CONDUCT.

Submit to Canvas by the 22nd of November, 11:59pm.

  • Submissions up to 1 week late will have 10 marks deducted.
  • Submissions up to 2 weeks late will have 20 marks deducted.

On submitting the assignment you adhere to the following declaration of authorship. If you have any doubt regarding the plagiarism policy discussed at the beginning of the semester do not hesitate in contacting me.

Declaration of Authorship

I, ___YOUR NAME___, declare that the work presented in this assignment titled Assignment 1: Spark Core and Spark SQL is my own. I confirm that:

  • This work was done wholly by me as part of my Msc. In Artificial Intelligence at Cork Institute of Technology.
  • Where I have consulted the published work and source code of others, this is always clearly attributed.
  • Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this assignment source code and report is entirely my own work.

EXERCISE 1.

Lets assume you live in Dublin, and you can see from your window the bus stop where the bus bringing you to work stops at every morning. The bus passes quite regularly, and your company is flexible with your starting time; thus, you can take the bus anytime from 7am to 10am.

Traffic jams in big cities are terrible, specially by the time you go to work, on weekdays at rush hour. This affects the bus schedule, making it difficult to predict its arrival time at the bus stop. Will the bus be on schedule, or delayed/ahead of time?

Moreover, is there a concrete hour in which the bus is, on average, more reliable? For example, of the possible hours where you can take the bus < [7am 8am), [8am 9am), [9am 10am) >, what is the hour with smaller average deviation w.r.t. its scheduled time?

The dataset you got in your hands certainly allows you to infer this information; so lets analyse it to find out.

EXERCISE: FORMAL DEFINITION

Given a program passing by parameters:

  • The bus stop bus_stop (e.g., 279)
  • The bus line bus_line (e.g., 40)
  • A list of hours hours_list (e.g., [07, 08, 09] for the intervals [7am 8am), [8am 9am) and[9am 10), resp.) Your task is to:
  • Compute the average delay of bus_line vehicles when stopping at bus_stop for each hour ofhours_list during weekdays (you must discard any measurement taking place on a Saturday or Sunday).

The format of the solution computed must be:

  • < (H1, AD1), (H2, AD2), , (Hn, ADn) > where:
  • Hi is one of the items of hours_list. E.g., [8am 9am).
  • ADi is the average delay for all bus_line vehicles stopping at bus_stop within that houron a weekday.
  • < (H1, AD1), (H2, AD2), , (Hn, ADn) > are also sorted by increasing order of ADi.

EXAMPLE SMALL DATASET

Lets assume we are interested in bus_line = 40, bus_stop = 279 and hours_list = [08, 09].

Lets consider the following measurements:

Date,Bus_Line,Bus_Line_Pattern,Congestion,Longitude,Latitude,Delay,Vehicle,Closer_Stop,At_Stop

  • 2013-01-19 08:00:36,40,015B1002,0,-6.258078,53.339279,544,33488,279,1 This measurement is not of interest to us: Weekend day.
  • 2013-01-09 08:00:36,41,015B1002,0,-6.258078,53.339279,544,33488,279,1 This measurement is not of interest to us: Wrong bus line.
  • 2013-01-09 08:00:36,40,015B1002,0,-6.258078,53.339279,544,33488,280,1 This measurement is not of interest to us: Wrong bus stop.
  • 2013-01-09 08:00:36,40,015B1002,0,-6.258078,53.339279,544,33488,279,0 This measurement is not of interest to us: The bus is not at stop.
  • 2013-01-09 18:00:36,40,015B1002,0,-6.258078,53.339279,300,33488,279,1 This measurement is not of interest to us: The hour is not in the list.
  • 2013-01-09 08:00:36,40,015B1002,0,-6.258078,53.339279,300,33488,279,1 This measurement is of interest to us. The bus reported 300 seconds delay.
  • 2013-01-09 08:25:36,40,015B1002,0,-6.258078,53.339279,-200,33488,279,1 This measurement is of interest to us. The bus reported -200 seconds ahead.
  • 2013-01-09 08:50:36,40,015B1002,0,-6.258078,53.339279,100,33488,279,1

This measurement is of interest to us. The bus reported 100 seconds delay. Now we have 3 measurements for buses passing in the interval [8am 9am): +300 seconds, -200 seconds, +100 seconds, resp. The average delay time per bus is: + 66.67.

Lets also assume our dataset is so small that it only contains the following measurements:

Date,Bus_Line,Bus_Line_Pattern,Congestion,Longitude,Latitude,Delay,Vehicle,Closer_Stop,At_Stop

  • 2013-01-09 08:00:36,40,015B1002,0,-6.258078,53.339279,300,33488,279,1
  • 2013-01-09 08:25:36,40,015B1002,0,-6.258078,53.339279,-200,33488,279,1
  • 2013-01-09 08:50:36,40,015B1002,0,-6.258078,53.339279,100,33488,279,1
  • 2013-01-19 08:00:36,40,015B1002,0,-6.258078,53.339279,300,33488,279,1
  • 2013-01-19 08:25:36,40,015B1002,0,-6.258078,53.339279,-200,33488,279,1
  • 2013-01-19 08:50:36,40,015B1002,0,-6.258078,53.339279,100,33488,279,1
  • 2013-01-29 08:00:36,40,015B1002,0,-6.258078,53.339279,300,33488,279,1
  • 2013-01-29 08:25:36,40,015B1002,0,-6.258078,53.339279,-200,33488,279,1
  • 2013-01-29 08:50:36,40,015B1002,0,-6.258078,53.339279,100,33488,279,1
  • 2013-01-09 09:00:36,40,015B1002,0,-6.258078,53.339279,100,33488,279,1
  • 2013-01-09 09:25:36,40,015B1002,0,-6.258078,53.339279,-100,33488,279,1
  • 2013-01-09 09:50:36,40,015B1002,0,-6.258078,53.339279,150,33488,279,1

SOLUTION EXAMPLE SMALL DATASET

SPARK CORE

  • solutionRDD:

< (09, 50.0), (08, 66.67) >

  • solutionRDD printed by the screen:

(09, 50.0)

(08, 66.67)

SPARK SQL

  • solutionDF:

< Row(hour=09, averageDelay=50.0), Row(hour=08, averageDelay=66.67) >

  • solutionDF printed by the screen: Row(hour=09, averageDelay=50.0)

Row(hour=08, averageDelay=66.67)

SOLUTION EXAMPLE ENTIRE DATASET

Given a program passing by parameters:

  • The bus stop bus_stop = 279
  • The bus line bus_line = 40
  • A list of hours hours_list = [07, 08, 09]

SPARK CORE

  • solutionRDD:

< (08, 12.09), (07, 134.4), (09, 331.82) >

  • solutionRDD printed by the screen:

(08, 12.09)

  • (07, 134.4)
  • (09, 331.82)

SPARK SQL

  • solutionDF:

< Row(hour=08, averageDelay=12.09), Row(hour=07, averageDelay=134.4), Row(hour=09, averageDelay=331.82) >

  • solutionDF printed by the screen: Row(hour=08, averageDelay=12.09)

Row(hour=07, averageDelay=134.4)

Row(hour=09, averageDelay=331.82)

EXERCISE 2.

We use to think that each physical bus vehicle serves a single bus line. However, this does not need to be the case. It is perfectly possible (indeed, it happens quite often) that a bus a concrete physical bus vehicle (e.g., 33145) serves one bus line (e.g., 25) from 8am to 2pm, and another bus line (e.g., 66) from 3pm to 9pm. A particular use-case of this could be increasing the amount of bus vehicles serving a concrete line at rush hours.

But, is this something that happens to all bus vehicles? Or, indeed something that happens every day?

Moreover, how many lines does a particular bus vehicle serve every day? Is there a particular busy day in which the bus vehicle serve a highest amount of different bus lines?

The dataset you got in your hands certainly allows you to infer this information; so lets analyse it to find out.

EXERCISE: FORMAL DEFINITION

Given a program passing by parameters:

  • The bus vehicle vehicle_id (e.g., 33145) Your task is to:
  • Compute the day(s) of the month in which this vehicle_id is serving the highest amount ofdifferent bus lines, and the IDs of such bus lines.

The format of the solution computed must be:

  • < (D1, LB1), (D2, LB2), , (Dn, LBn) > where:
  • Di is the day of the month serving maximum amount of bus lines.
  • LBi is the list of bus lines served in that concrete day Di.
  • LBi is sorted by increasing order in the bus line Ids.
  • Note that < (D1, LB1), (D2, LB2), , (Dn, LBn) > will as many pairs (Di, LBi) as days serve the very same amount of max bus lines.
  • If < (D1, LB1), (D2, LB2), , (Dn, LBn) > contains more than one pair (Di, LBi), then the pairs are also sorted by increasing order of Di.

EXAMPLE SMALL DATASET

Lets assume we are interested in vehicle_id = 33145.

Lets consider the following measurements:

Date,Bus_Line,Bus_Line_Pattern,Congestion,Longitude,Latitude,Delay,Vehicle,Closer_Stop,At_Stop

  • 2013-01-19 09:00:36,40,015B1002,0,-6.258078,53.339279,544,33245,279,1

This measurement is not of interest to us: Wrong vehicle ID.

  • 2013-01-09 09:00:36,40,015B1002,0,-6.258078,53.339279,544,33145,279,1

This measurement is of interest to us. The vehicle 33145 serves the line 40 on Wednesday 09th of January.

  • 2013-01-09 10:00:36,40,015B1002,0,-6.258078,53.339279,544,33145,279,1

This measurement is of interest to us, but it adds nothing new, as we already knew vehicle 33145 served the line 40 on Wednesday 09th of January.

Lets also assume our dataset is so small that it only contains the following measurements:

Date,Bus_Line,Bus_Line_Pattern,Congestion,Longitude,Latitude,Delay,Vehicle,Closer_Stop,At_Stop

  • 2013-01-04 09:00:36,25,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-08 09:00:36,122,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-08 19:00:36,120,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-09 09:00:36,66,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-09 15:00:36,25,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-09 22:00:36,67,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-17 09:00:36,39,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-17 15:00:36,67,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-17 22:00:36,66,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-28 09:00:36,67,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-28 19:00:36,25,015B1002,0,-6.258078,53.339279,544,33145,279,1
  • 2013-01-30 09:00:36,122,015B1002,0,-6.258078,53.339279,544,33145,279,1

Thus, it is clear that the days our vehicle serves most lines are Wednesday 9th of January (where it serves 3 different lines, 25, 66 and 67), Thursday 17th of January (where it serves 3 different lines, 39, 66 and 67). The rest of days the vehicle serves less than 3 different bus lines.

SOLUTION EXAMPLE SMALL DATASET

SPARK CORE

  • solutionRDD:

< (09, [25, 66, 67]), (17, [39, 66, 67]) >

  • solutionRDD printed by the screen:

(09, [25, 66, 67])

(17, [39, 66, 67])

SPARK SQL

  • solutionDF:

< Row(day=09, sortedBusLineIDs=[25, 66, 67]), Row(day=17, sortedBusLineIDs=[39, 66, 67]) >

  • solutionDF printed by the screen:

Row(day=09, sortedBusLineIDs=[25, 66, 67])

Row(day=17, sortedBusLineIDs=[39, 66, 67])

SOLUTION EXAMPLE ENTIRE DATASET

Given a program passing by parameters:

The bus vehicle vehicle_id = 33145

SPARK CORE

  • solutionRDD:

< (02, [38, 41, 171]), (28, [13, 140, 171]), (29, [40, 171, 238]) >

  • solutionRDD printed by the screen:

(02, [38, 41, 171])

(28, [13, 140, 171])

(29, [40, 171, 238])

SPARK SQL

  • solutionDF:

< Row(day=02, sortedBusLineIDs=[38, 41, 171]), Row(day=28, sortedBusLineIDs=[13,

140, 171]), Row(day=29, sortedBusLineIDs=[40, 171, 238]) >

  • solutionDF printed by the screen:

Row(day=02, sortedBusLineIDs=[38, 41, 171])

Row(day=28, sortedBusLineIDs=[13, 140, 171])

Row(day=29, sortedBusLineIDs=[40, 171, 238])

EXERCISE 3.

We said in exercise 1 that traffic jams are terrible and, by the time of writing exercise 3, we still have not changed our opinion. When it comes to transportation in big cities we all know that traffic congestion can be just past the corner.

But, is this something that happens every day and every hour? Most likely the answer is yes, every day, and every hour, there will be some buses reporting traffic congestion in some of their measurements.

Moreover, if we had a bar, a threshold, to consider what percentage of traffic congestion measurements is too much, what days and hours will go above that threshold?

The dataset you got in your hands certainly allows you to infer this information; so lets analyse it to find out.

EXERCISE: FORMAL DEFINITION

Given a program passing by parameters:

  • The aforementioned bar threshold_percentage (e.g., 10.0) Your task is to:
  • Compute the concrete days and hours having a percentage of measurements reportingcongestions above threshold_percentage.

The format of the solution computed must be:

  • < (D1, H1, P1), (D2, H2, P2), , (Dn, Hn, Pn) > where:
  • Di is the day of the month (e.g., 11 for Friday 11th of January).
  • Hi is the hour interval of the day (e.g, 09 for the interval [9am 10am).
  • Pi is the percentage of congestion measurements (e.g., 3.54).
  • < (D1, H1, P1), (D2, H2, P2), , (Dn, Hn, Pn) > are sorted by decreasing order of Pi.

EXAMPLE SMALL DATASET

Lets assume we are interested in threshold_percentage = 35.0.

Lets also assume our dataset is so small that it only contains the following measurements:

Date,Bus_Line,Bus_Line_Pattern,Congestion,Longitude,Latitude,Delay,Vehicle,Closer_Stop ,At_Stop

  • 2013-01-01 09:15:36,41,015B1002,0,-6.258078,53.339279,200,33145,122,1
  • 2013-01-01 09:30:36,42,015B1002,0,-6.258078,53.339279,100,33245,120,1 2013-01-01 09:45:36,43,015B1002,1,-6.258078,53.339279,200,33345,66,1
  • 2013-01-02 09:15:36,41,015B1002,0,-6.258078,53.339279,200,33145,122,1
  • 2013-01-02 09:30:36,42,015B1002,1,-6.258078,53.339279,100,33245,120,1 2013-01-02 09:45:36,43,015B1002,1,-6.258078,53.339279,200,33345,66,1
  • 2013-01-03 09:15:36,41,015B1002,1,-6.258078,53.339279,200,33145,122,1
  • 2013-01-03 09:30:36,42,015B1002,1,-6.258078,53.339279,100,33245,120,1
  • 2013-01-03 09:45:36,43,015B1002,1,-6.258078,53.339279,200,33345,66,1

As we can see:

  • On Tuesday 1st of January, [9am 10am) there are 3 measurements and 1 report congestion => 33.33%.
  • On Wednesday 2nd of January, [9am 10am) there are 3 measurements and 2 report congestion => 66.67%.
  • On Thursday 3rd of January, [9am 10am) there are 3 measurements and 3 report congestion => 100.00%.

Thus, both Wednesday 2nd [9am 10am) and Thursday 3rd [9am 10am) are above threshold_percentage.

SOLUTION EXAMPLE SMALL DATASET

SPARK CORE

  • solutionRDD:

< (03, 09, 100.0), (02, 09, 66.67) >

  • solutionRDD printed by the screen:

(03, 09, 100.0)

(02, 09, 66.67)

SPARK SQL

  • solutionDF:

< Row(day=03, hour=09, percentage=100.0), Row(day=02, hour=09, percentage=66.67) >

  • solutionDF printed by the screen:

Row(day=03, hour=09, percentage=100.0)

Row(day=02, hour=09, percentage=66.67)

SOLUTION EXAMPLE ENTIRE DATASET

Given a program passing by parameters:

The bar threshold_percentage = 10.0.

SPARK CORE

  • solutionRDD:

< (28, 01, 45.0), (20, 02, 23.06), (20, 01, 21.34) >

  • solutionRDD printed by the screen:

(28, 01, 45.0)

(20, 02, 23.06)

(20, 01, 21.34)

SPARK SQL

  • solutionDF:

< Row(day=28, hour=01, percentage=45.0), Row(day=20, hour=02, percentage=23.06), Row(day=20, hour=01, percentage=21.34) >

  • solutionDF printed by the screen: Row(day=28, hour=01, percentage=45.0)

Row(day=20, hour=02, percentage=23.06)

Row(day=20, hour=01, percentage=21.34)

EXERCISE 4.

When exploring a city as a tourist, there is nothing like a nice walk to observe things at a calm pace. However, in order to get a quick taste of a city, hop on and off to a bus can be the best option, especially when the distance among points of interest is very big.

Lets suppose we are in the street and we divise a bus stop. Tired as we are after a few days of tourism walking long hours, we claim to ourselves: Ok, decision made: for the first 30 minutes today we are not going to walk anymore. What time is it now? 9am. Perfect, so until 9.30am, no walking at all! Instead, look at this bus stop accross the street, lets jump in to the first bus stopping at it, and see where does this bus bring us before 9.30am.

The dataset you got in your hands certainly allows you to infer this information; so lets analyse it to find out.

EXERCISE: FORMAL DEFINITION

Given a program passing by parameters:

  • The current time current_time (e.g., 2013-01-10 08:59:59)
  • A bus stop current_stop (e.g., 1935)
  • The time you allocate to yourself before you start walking again seconds_horizon (e.g.,1800 seconds, which is half an hour).

Your task is to:

  • Compute the first bus stopping at current_stop and the list of bus stops it bring us withinthat time seconds_horizon.

The format of the solution computed must be:

  • < (V, [(T1, S1), (T2, S2), , (Tn, Sn)]) > where:
  • V is the first bus vehicle stopping at current_stop after current_time.
  • [(T1, S1), (T2, S2), , (Tn, Sn)] is the list of other bus stops this bus vehicle stops at before the end of current_time + seconds_horizon, with Si being a bus stop and Ti the time stopping at it.
  • The list includes current_stop as T1, so as to know the time we hopped on the bus.

EXAMPLE SMALL DATASET

Lets assume we are interested in current_time = 2013-01-10 08:59:59, current_stop = 1935 and seconds_horizon = 1800.

Lets consider the following measurements:

Date,Bus_Line,Bus_Line_Pattern,Congestion,Longitude,Latitude,Delay,Vehicle,Closer_Stop ,At_Stop

  • 2013-01-10 08:00:59,40,015B1002,0,-6.258078,53.339279,544,33488,1935,1 This measurement is not of interest to us: Its before our current time.
  • 2013-01-10 10:00:59,40,015B1002,0,-6.258078,53.339279,544,33488,1935,1

This measurement is not of interest to us: Its after our current time + seconds_horizon.

  • 2013-01-10 09:05:59,40,015B1002,0,-6.258078,53.339279,544,33488,279,1 This measurement is not of interest to us: Wrong bus stop.
  • 2013-01-10 09:05:59,50,015B1002,0,-6.258078,53.339279,544,33488,1935,0 This measurement is not of interest to us: Bus does not stop at bus stop.
  • 2013-01-20 09:05:59,40,015B1002,0,-6.258078,53.339279,544,33500,1935,1

This measurement is not of interest to us: Wrong day, so its after our current time + seconds_horizon.

  • 2013-01-10 09:05:59,40,015B1002,0,-6.258078,53.339279,544,33500,1935,1

If this is the earliest bus arriving then it is interest to us: we hop on that bus.

  • 2013-01-10 09:10:59,60,015B1002,0,-6.258078,53.339279,544,33600,1935,1

If this is not the earliest bus arriving then it is not of interest to us: we could have hopped on that bus, but another one arrived earlier, and thus we already hopped on it.

  • 2013-01-10 09:15:59,40,015B1002,0,-6.458078,53.439279,300,33500,244,1

If we hopped in this vehicle, this is of interest to us: our bus is stopping at a new stop.

  • 2013-01-10 09:25:59,40,015B1002,0,-6.658078,53.639279,200,33500,264,1

If we hopped in this vehicle, this is of interest to us: our bus is stopping at a new stop.

  • 2013-01-10 09:35:59,40,015B1002,0,-6.858078,53.839279,100,33500,284,1

Even if we hopped in this vehicle, this is no longer of interest to us: the time is after our current time + seconds_horizon.

Lets also assume our dataset is so small that it only contains the following measurements:

Date,Bus_Line,Bus_Line_Pattern,Congestion,Longitude,Latitude,Delay,Vehicle,Closer_Stop ,At_Stop

  • 2013-01-10 08:00:59,40,015B1002,0,-6.258078,53.339279,544,33488,1935,1
  • 2013-01-10 10:00:59,40,015B1002,0,-6.258078,53.339279,544,33488,1935,1
  • 2013-01-10 09:05:59,40,015B1002,0,-6.258078,53.339279,544,33488,279,1
  • 2013-01-10 09:05:59,50,015B1002,0,-6.258078,53.339279,544,33488,1935,0
  • 2013-01-10 09:05:59,40,015B1002,0,-6.258078,53.339279,544,33500,1935,1
  • 2013-01-10 09:10:59,60,015B1002,0,-6.258078,53.339279,544,33600,1935,1
  • 2013-01-10 09:15:59,40,015B1002,0,-6.458078,53.439279,300,33500,244,1
  • 2013-01-10 09:25:59,40,015B1002,0,-6.658078,53.639279,200,33500,264,1
  • 2013-01-10 09:35:59,40,015B1002,0,-6.858078,53.839279,100,33500,284,1

As we can see:

  • We hopped on bus vehicle 33500, arriving at stop 1935 at 2013-01-10 09:05:59.
  • Bus vehicle 33500 arrived at stop 244 at 2013-01-10 09:15:59.
  • Bus vehicle 33500 arrived at stop 264 at 2013-01-10 09:25:59.
  • Bus vehicle 33500 arrived at stop 284 at 2013-01-10 09:35:59, but that was already outside of our seconds_horizon, so we hopped off the bus at previous stop 264.

Thus, our bus vehicle was 33500 and the stations we are interested at were:

  • 2013-01-10 09:05:59, stop 1935.
  • 2013-01-10 09:15:59, stop 244.
  • 2013-01-10 09:25:59, stop 264.

SOLUTION EXAMPLE SMALL DATASET

SPARK CORE

  • solutionRDD:

< (33500, [(2013-01-10 09:05:59, 1935), (2013-01-10 09:15:59, 244), (2013-01-10 09:25:59, 264)]) >

  • solutionRDD printed by the screen:

(33500, [(2013-01-10 09:05:59, 1935),

(2013-01-10 09:15:59, 244),

(2013-01-10 09:25:59, 264)

]

)

Please note the RDD contains just one item, thus it is printed in just one line, with all the items of the list in a consecutive manner, not one item of the list per line.

SPARK SQL

  • solutionDF:

< Row(vehicleID=33500, stations=[Row(time=2013-01-10 09:05:59, stop=1935), Row(time=2013-01-10 09:15:59, stop=244), Row(time=2013-01-10 09:25:59, stop=264)]) >

  • solutionDF printed by the screen:

Row(vehicleID=33500, stations=[Row(time=2013-01-10 09:05:59, stop=1935),

Row(time=2013-01-10 09:15:59, stop=244),

Row(time=2013-01-10 09:25:59, stop=264)

]

)

Please note the DF contains just one item, thus it is printed in just one line, with all the items of the list in a consecutive manner, not one item of the list per line.

SOLUTION EXAMPLE ENTIRE DATASET

Given a program passing by parameters:

  • The current time current_time = 2013-01-10 08:59:59
  • A bus stop current_stop = 1935
  • The time you allocate to yourself before you start walking again seconds_horizon = 1800

SPARK CORE

  • solutionRDD:

< (38022, [(2013-01-10 09:00:49, 1935), (2013-01-10 09:04:07, 1938), (2013-01-

10 09:16:28, 1457), (2013-01-10 09:20:47, 6094)]) >

  • solutionRDD printed by the screen:

(38022, [(2013-01-10 09:00:49, 1935),

(2013-01-10 09:04:07, 1938),

(2013-01-10 09:16:28, 1457),

(2013-01-10 09:20:47, 6094)

]

)

Please note the RDD contains just one item, thus it is printed in just one line, with all the items of the list in a consecutive manner, not one item of the list per line.

SPARK SQL

  • solutionDF:

< Row(vehicleID=38022, stations=[Row(time=2013-01-10 09:00:49, stop=1935), Row(time=2013-01-10 09:04:07, stop=1938), Row(time=2013-01-10 09:16:28, stop=1457), Row(time=2013-01-10 09:20:47, stop=6094)]) >

  • solutionDF printed by the screen (please note just one line is printed):
  • Row(vehicleID=38022, stations=[Row(time=2013-01-10 09:00:49, stop=1935),

Row(time=2013-01-10 09:04:07, stop=1938),

Row(time=2013-01-10 09:16:28, stop=1457),

Row(time=2013-01-10 09:20:47, stop=6094)

]

)

Please note the DF contains just one item, thus it is printed in just one line, with all the items of the list in a consecutive manner, not one item of the list per line.

EXERCISE 5.

Write a report of up to 1,000 words where you present and discuss:

  • A novel exercise to be included in the data analysis of the Dublin Bus dataset.

There is no need to implement the new exercise, you just need to discuss it in terms of:

  • Its originality It has to be different from the 4 exercises proposed.
  • Its relevance Include a potential use-case derived from the exercise you are proposing.
  • Its viability

Do not implement the exercise, but briefly discuss in natural language the main steps that would be needed so as to implement it.

Include in the discussion whether you will choose to implement it in Spark Core or Spark SQL, justying your selection.

Position the new exercise in terms of difficulty with respect to the four exercises proposed.

Shopping Cart
[Solved] BigData Assignment1-SPARK CORE & SPARK SQL
$25