[Solved] CS585 Project 1-Map-reduce jobs

$25

File Name: CS585_Project_1-Map-reduce_jobs.zip
File Size: 292.02 KB

SKU: [Solved] CS585 Project 1-Map-reduce jobs Category: Tag:
5/5 - (1 vote)

In this project, you will write map-reduce jobs in Java language, Apache Pig scripts, and run them on Hadoop system.Detailed DescriptionYou are asked to perform three activities in this project, (1) Create datasets, (2) upload the datasets intoHadoop HDFS, (3) Query the data by writing map-reduce Java code, and (4) Query the data using Pigscripts.1-Createing DatasetsWrite a java program that creates two datasets (two files), Customers and Transactions. Each line in Customers file represents one customer, and each line in Transactions file represents one transaction. The attributed within each line are comma separated.The Customers dataset should have the following attributes for each customer:ID: unique sequential number (integer) from 1 to 50,000 (that is the file will have 50,000 line)Name: random sequence of characters of length between 10 and 20 (do not include commas)Age: random number (integer) between 10 to 70Gender: string that is either male or femaleCountryCode: random number (integer) between 1 and 10Salary: random number (float) between 100 and 10000The Transactions dataset should have the following attributes for each transaction:TransID: unique sequential number (integer) from 1 to 5,000,000 (the file has 5M transactions)CustID: References one of the customer IDs, i.e., from 1 to 50,000 (on Avg. a customer has 100 trans.)TransTotal: random number (float) between 10 and 1000TransNumItems: random number (integer) between 1 and 10TransDesc: random text of characters of length between 20 and 50 (do not include commas)Note: The column names will NOT be stored in the file. Only the values comma separated. Form the order ofthe columns; you will know each column represents what.2-Uploading Data into HadoopUse hadoop file system commands (e.g., put) to upload the files you created to Hadoop cluster.To learn about the file system commands check this link:https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.htmlNote: It is good to check your files and see how the files are divided into blocks and each block is replicated.You can do that by checking the web Interface of Hadoop (Check the Readme file in your virtual machine toknow to do that).33-Writing MapReduce JobsYou will write Java programs to query the data in Hadoop. Before writing your code you should perfectlyunderstand the WordCount example (it like the Hello World example in Java). You can find its codeonline, and it is also included in your virtual machine (Check the Readme file).Notes: You should decide whether each query is a map-only job or a map-reduce job, and write yourcode based on that. A given query may require more that a single map-reduce job to be done. You can always check the query output file from the HDFS website and see its content. You can test your code on a small file first to make sure it is working correctly before running iton the large datasets.Hint: It is important two know how Hadoop reads and writes integers, floats, and text fields. CheckIntWritable, FloatWritable, and Text classes to know which one to use and when.3.1) Query 1Write a job(s) that reports the customers whose Age between 20 and 50 (inclusive).3.2) Query 2Write a job(s) that reports for every customer, the number of transactions that customer did and the totalsum of these transactions. The output file should have one line for each customer containing:CustomerID, CustomerName, NumTransactions, TotalSumYou are required to use a Combiner in this query.3.3) Query 3Write a job(s) that joins the Customers and Transactions datasets (based on the customer ID) and reportsfor each customer the following info:CustomerID, Name, Salary, NumOf Transactions, TotalSum, MinItemsWhere NumOfTransactions is the total number of transactions done by the customer, TotalSum is the sumof field TransTotal for that customer, and MinItems is the minimum number of items in transactionsdone by the customer.3.4) Query 4 [Write a job(s) that reports for every country code, the number of customers having this code as well asthe min and max of TransTotal fields for the transactions done by those customers. The output file shouldhave one line for each country code containing:CountryCode, NumberOfCustomers, MinTransTotal, MaxTransTotal4Hint: To get the full mark of Query 4, you need to do it in a single map-reduce job. If you did it using two map-reducejobs, you will loose 8 Points.3.5) Query 5Assume we want to design an analytics task on the data as follows:1) The Age attribute is divided into six groups, which are [10, 20), [20, 30), [30, 40), [40, 50), [50,60), and [60, 70]. The bracket [ means the lower bound of a range is included, where as )means the upper bound of a range is excluded.2) Within each of the above age ranges, further division is performed based on the Gender, i.e.,each of the 6 age groups is further divided into two groups.3) For each group, we need to report the following info:Age Range, Gender, MinTransTotal, MaxTransTotal, AvgTransTotal4-Writing Apache-Pig Jobs4.1) Query 1Write an Apache Pig query that reports the customer names that have the least number of transactions.Your output should be the customer names, and the number of transactions.4.2) Query 2Write an Apache Pig query that join Customers and Transactions using Broadcast (replicated) join. Thequery reports for each customer the following info:CustomerID, Name, Salary, NumOf Transactions, TotalSum, MinItemsWhere NumOfTransactions is the total number of transactions done by the customer, TotalSum is the sumof field TransTotal for that customer, and MinItems is the minimum number of items in transactionsdone by the customer.4.3) Query 3Write an Apache Pig query that reports the Country Codes having number of customers greater than5,000 or less than 2,000.4.4) Query 4Write an Apache Pig query that implements Query 3.5 above.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CS585 Project 1-Map-reduce jobs
$25