[Solved] INF553 Assignment 1-Analysis using Spark

$25

File Name: INF553_Assignment_1_Analysis_using_Spark.zip
File Size: 376.8 KB

SKU: [Solved] INF553 Assignment 1-Analysis using Spark Category: Tag:
5/5 - (1 vote)

In this assignment, students will complete three tasks. The goal of these tasks is to let students get familiar with Spark and perform data analysis using Spark. In the assignment description, the first part is about how to configure the environment and data sets, the second part describes the three tasks in details, and the third part is about the files the students should submit and the grading criteria.

Spark Installation

Spark can be downloaded from the official website (refer to: lin k)

Please use Spark 2.3.1 with Hadoop 2.7 for this assignment. The interface of Spark official website is shown in the following figure.

Scala Installation

You can use Intellij if you prefer IDE for creating and debugging projects. And install Scala/SBT plugins for Intellij. You can refer to the tutorial Setting UP Spark 2.0 environment on intellij community edition.

Python Configuration

You need to add the paths of your Spark (path/to/your/Spark) and Python

(path/to/your/Spark/python) folders to the interpreters environment variables named as

SPARK_HOME and PYTHONPATH, respectively.

Environment Requirements

Python: 2.7, Scala: 2.11, Spark: 2.3.1

IMPORTANT: We will use these versions to compile and test your code. If you use other versions, there will be a 20% penalty since we will not be able to grade it automatically.

Data

Please download the Stack Overflow 2018 Developer Survey data from this lin k. Detailed introduction of the data can also be found through the link.

You are required to download the dataset that contains two files: survey_results_public.csv and survey_results_schema.csv . The first file contains the survey responses and will be required for this homework. The second file describes the 129 columns of the dataset. In this assignment only 3 columns of the dataset will be used: Country ,

Salary, and SalaryType .

Task1:

Students are required to compute the total number of survey responses per country that have provided a salary value i.e., response entries containing NA or 0 salary values are considered non-useful responses and should be discarded.

Result format:

  1. Save the result as one csv file;
  2. The first line in the file should contain the keyword Total and the total number of survey responses containing a salary;
  3. The result is ordering by country in ascending order.

The following snapshot is an example of result for Task 1:

Task2:

Since processing large volumes of data requires performance decisions, properly partitioning the data for processing is imperative. In this task, students are required to show the number of partitions for the RDD built in Task 1 and show the number of items per partition. Then, students have to use the partition function (using the country value as driver) to improve the performance of map and reduce tasks. A time span comparison between the standard (RDD used in Task 1) and partition (RDD built using the partition function) should also be shown.

The following snapshot is an example of result for Task 2:

Task3:

Students are required to compute annual salary averages per country and show min and max salaries.

Hints for Task 3:

  1. Some salary values represent weekly or monthly payments. Recall performing the appropriate transformations to compute the annual salary. The value in the column SalaryType informs whether the salary amount is annual, weekly, or monthly.

Result format:

  1. Save the result as one csv file.
  2. The result is ordering by country in ascending order.
  3. Columns should present: country , number of salaries, min salary, max salary, and average salary.

The following snapshot is an example of result for Task 3:

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] INF553 Assignment 1-Analysis using Spark[Solved] INF553 Assignment 1-Analysis using Spark
$25