In this project, you will write Scala/SparkSQL code for executing jobs over Spark infrastructure.You can either work on a Cluster Mode (where files are read/written over HDFS), or Local Mode (where files are read/written over local file system). This should not affect your code design, i.e., in either case your code should be designed with the highest degree of parallelization possible. Spark Virtual Machine .You can either build your own VM or work with the one provided to you (See below).Problem 1 SparkSQL (Transaction Data Processing)Use the Transaction dataset T that you created in Project 1 and create a Spark workflow to do the following. [Use SparkSQL to write this workflow.] Start with T as a dataframe.1) T1: Filter out (drop) the transactions from T whose total amount is less than $2002) T2: Over T1, group the transactions by the Number of Items it has, and for each groupcalculate the sum of total amounts, the average of total amounts, the min and the max ofthe total amounts.3) Report back T2 to the client side4) T3: Over T1, group the transactions by customer ID, and for each group report thecustomer ID, and the transactions count.5) T4: Filter out (drop) the transactions from T whose total amount is less than $6006) T5: Over T4, group the transactions by customer ID, and for each group report thecustomer ID, and the transactions count.7) T6: Select the customer IDs whose T5.count * 3 < T3.count8) Report back T6 to the client side
Problem 2 Spark-RDDs (Grid Cells of High Relative-Density Index)Overview:Assume a two-dimensional space that extends from 110,000 in each dimension as shown in Figure 1. There are points scattered all around the space. The space is divided into pre-defined grid-cells, each of size 2020. That is, there is 500500= 250,000 grid cell in the space. Each cell has a unique ID as indicated in the Figure. Given an ID of a grid cell, you can calculate the row and the column it belongs to using a simple mathematical equation. Neighbor Definition: For a given grid cell X, N(X) is the set of all neighbor cells of X, which are the cells with which X has a common edge or corner. The Figure illustrates different examples of neighbors. Each non-boundary grid cell has 8 neighbors. However, boundary cells will have less number of neighbors (See the figure). Since the grid cell size is fixed, the IDs of the neighbor cells of a given cell can be computed using a formula (mathematical equations) in a short procedure.Example: N(Cell 1) = {Cell 2, Cell 501, Cell 502}N(Cell 1002) = {Cell 501, Cell 502, Cell 503, Cell 1001, Cell 1003, Cell 1501, Cell 1502, Cell 1503}Relative-Density Index: For a given grid cell X, I(X) is a decimal number that indicates the relativedensity of cell X compared to its neighbors. It is calculated as follows.I(X) = X.count / Average (Y1.count, Y2.count, Yn.count)5Where X.count means the count of points inside grid cell X, and {Y1, Y2, , Yn} are the neighbors of X.That is N(X) = {Y1, Y2, , Yn}Step 1 (Create the Datasets)[10 Points] //You can re-use your code from Project 2 Your task in this step is to create one dataset P (set of 2D points). Assume the space extends from110,000 in the both the X and Y axis. Each line in the file should contain one point in the format (a,b), where a is the value in the X axis, and b is the value in the Y axis. Scale the dataset to be at least 100MB. Choose the appropriate random function (of your choice) to create the points. Upload and store the file into HDFSStep 2 (Report the TOP 50 grid cells w.r.t Relative-Density Index)[40 Points]In this step, you will write Scala or Java code (it is your choice) to manipulate the file and report the top50 grid cells (the grid cell IDs not the points inside) that have the highest I index. Write the workflow thatreports the cell IDs along with their relative-density index.Your code should be fully parallelizable (distributed) and scalable.Step 3 (Report the Neighbors of the TOP 50 grid)[20 Points]Continue over the results from Step 2, and for each of the reported top 50 grid cells, report the IDs andthe relative-density indexes of its neighbor cells.6Problem 3 SparkSQL (Matrix-Matrix Multiplication)Refer to the lecture slides under Week 6, Hadoop Analytics 2, Slides 27-31. These slides describea distributed matrix-matrix operation. Your task is to create two files M1 & M2, where M1represents Matrix 1, and M2 represents Matrix 2. The structure of each file is illustrated in Slide28. Assume the dimensions of M1 and M2 are 10001000, i.e., each matrix has one millionentries. Populate the files with random integer values for each entry.Then use Spark SQL (data frames) to multiply M1 and M2. Follow the ideas of the two mapreducejobs presented in Slides 29-31.
Reviews
There are no reviews yet.