[Solved] INF553 Homework 1-Yelp dataset

$25

File Name: INF553_Homework_1-Yelp_dataset.zip
File Size: 282.6 KB

SKU: [Solved] INF553 Homework 1-Yelp dataset Category: Tag:
5/5 - (1 vote)

Task1: Data Exploration

You will explore the dataset, review.json, containing review information for this task, and you need to write a program to automatically answer the following questions:

  1. The total number of reviews
  2. The number of reviews in 2018
  3. The number of distinct users who wrote reviews
  4. The top 10 users who wrote the largest numbers of reviews and the number of reviews they wrote
  5. The number of distinct businesses that have been reviewed
  6. The top 10 businesses that had the largest numbers of reviews and the number of reviews they had

Input format: (we will use the following command to execute your code)

Param: input_file_name: the name of the input file (review), including file path Param: output_file_name: the name of the output JSON file, including file path

Output format:

IMPORTANT: Please strictly follow the output format since your code will be graded automatically.

  1. The output for Questions A/B/C/E will be a number. The output for Questions D/F will be a list, which is sorted by the number of reviews in the descending order. If two user_ids/business_ids have the same number of reviews, please sort the user_ids /business_ids in the alphabetical order.
  2. You need to write the results in the JSON format file. You must use exactly the same tags (see the red boxes in Figure 2) for answering each question.

4.2 Task2: Partition

Since processing large volumes of data requires performance decisions, properly partitioning the data for processing is imperative.

In this task, you will show the number of partitions for the RDD used for Task 1 Question F and the number of items per partition. Then, you need to use a customized partition function to improve the performance of map and reduce tasks. A time duration (for executing Task 1 Question F) comparison between the default partition and the customized partition (RDD built using the partition function) should also be shown in your results.

Input format: (we will use the following command to execute your code)

Param: input_file_name: the name of the input file (review), including file path

Param: output_file_name: the name of the output JSON file, including file path Param: n_partition: the number of partitions (say, 8)

Output format:

  1. The output for the number of partition and execution time will be a number. The output for the number of items per partition will be a list of numbers. You will also need to describe and explain the above outputs within 1 or 2 sentences.
  2. You need to write the results in a JSON file. You must use exactly the same tags (see the red boxes in Figure 3) for the task.

4.3 Task3: Exploration on Multiple Datasets

In task3, you are asked to explore two datasets together containing review information (review.json) and business information (business.json) and write a program to answer the following questions:

  1. What is the average stars for each city? (DO NOT use the stars information in the business file) (1 point)
  2. You are required to use two ways to print top 10 cities with highest stars. You need to compare the time difference between two methods and explain the result within 1 or 2 sentences. (1 point)

Method1: Collect all the data, and then print the first 10 cities

Method2: Take the first 10 cities, and then print all

Input format: (we will use the following command to execute your code)

Param: input_file_name1: the name of the input file (review), including file path

Param: input_file_name2: the name of the input file (business), including file path

Param: output_file_name1: the name of the output file/folder for Question a, including file path

Param: output_file_name2: the name of the output JSON file for Question b, including file path

Output format:

  1. You need to write the results for Question A as a file /folder, named firstname_lastname_task3_output (all lowercase). The header (first line) of the file is city,stars. The outputs should be sorted by the average stars in descending order. If two cities have the same stars, please sort the cities in the alphabetical order. (see Figure 4 left)
  2. You also need to write the answer for Question B in a JSON file. You must use exactly the same tags (see the red boxes in Figure 4 right) for the task.

Figure 4: Question A output file structure (left) and JSON output structure (right) for task3

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] INF553 Homework 1-Yelp dataset
$25