csvcsvParking-Violation.csvopen-violation.csv
Spark
from csv import reader
lines = sc.textFile(sys.argv[1], 1)
lines = lines.mapPartitions(lambda x: reader(x))
Task 1: Find all parking violations that have been paid, i.e., that do not occur in open-violations.csv.
Output: A key-value* pair per line, where:
key = summons_number
values = plate_id, violation_precinct, violation_code, issue_date
(*Note: separate key and value by the tab character (t), and elements within the key/value should be separated by a comma then a space. This applies to all tasks below)
Your output format should conform to the format of following examples:
1307964308 GBH2444, 74, 46, 2016-03-07
4617863450 HAM2650, 0, 36, 2016-03-24
To complete this task,
1) Write a map-reduce job. Run Hadoop using 2 reducers.
2) Write a Spark program.
Task 2: Find the frequencies of the violation types in parking_violations.csv, i.e., for each violation code, the number of violations that this code has.
Output: A key-value pair per line, where:
key = violation_code
value = number of violations
Here are sample outputs with 1 key-value pair:
1 159
46 100
To complete this task,
1) Write a map-reduce job. Run Hadoop using 2 reducers.
2) Write a Spark program.
Reviews
There are no reviews yet.