[SOLVED] Spark statistic School of Computer Science Dr. Ying Zhou

$25

File Name: Spark_statistic_School_of_Computer_Science_Dr._Ying_Zhou.zip
File Size: 527.52 KB

5/5 - (1 vote)

School of Computer Science Dr. Ying Zhou
COMP5349: Cloud Computing Sem. 1/2019
Assignment 2: Online Review Analysis
Group Work: 20% 02.05.2019
1 Introduction
Assignment 2 is a group project. In this project you will work on the online review data set released by Amazon and go through various analytic phases, including getting summary statistics, removing unwanted data and performing similarity analysis. The phases and sequence are typical in many real world data analytic workloads. The objective is to test your ability to apply big data framework in realistic setting.
2 Data Set
The complete data set contains product reviews and metadata from Amazon between 1995
and 2015. The data is available in multiple compressed TSV files in the amazon-reviews-pds
S3 bucket in AWS US East Region. Files are separated by product category. The format and index of files can be found from Amazon Review Data format and File Index. Most cate- gorys data are saved in a single file. For instance, file amazon reviews us Watches v1 00.tsv.gz contains all reviews of watches on Amazon US site. Some category such as book contains
a large number of reviews and they are stored in many files.
Many features of reviews are included in the data set. These include customer id,
review headline, review body, rating, helpfulness votes and so on. Some meta data about the product are also included.
The project asks you to analyze data from a single category. In subsequent sections, data set refers to data from a single category, not the complete data set. You can choose ONE category from the following:
Music: 1 file; zipped size is 1.4G
PC: 1 file; zipped size is 1.4G
Video DVD: 1 file; zipped size is 1.4G
Wireless: 1 file; zipped size is 1.6G
Digital Ebook Purchase: 2 files; zipped size 2.6 and 1.2G each
1

Books: 3 files; zipped file size 2.6G, 2.5G and 1.2G each
The file name has the format amazon reviews us v1 00.tsv.gz for single file category. Multiple file category would have 00, 01, and/or 02 as the last two digits for various files.
The following features will be involved in the analysis: customer id, product id, star rating, review id and review body.
3 Stage One: Overall statistics
In this stage you are asked to produce overall summary statistics of the data set, in partic- ular, statistics related with distribution of reviews by user, and by product. The user and product are uniquely identified by customer id and product id respectively.
You are asked to find out the following numbers:
the total number of reviews
the number of unique users
the number of unique products
For user-review distribution, you are asked to find out:
the largest number of reviews published by a single user
the top 10 users ranked by the number of reviews they publish the median number of reviews published by a user
For product-review distribution, you are asked to find out:
the largest number of reviews written for a single product
the top 10 products ranked by the number of reviews they have the median number of reviews a product has
4 Stage Two: Filtering Unwanted Data
In this stage, you are asked to filter reviews based on length, reviewer and product feature. In particular, the following reviews should be removed:
reviews with less than two sentences in the review body.
reviews published by users with less than median number of reviews published
2

reviews from products with less than median number of reviews received
Note that sentence segmentation can be done in various ways. You may use simple punctuation character like period or question mark to segment the review body, or use tools from NLP packages. We do not expect each group to generate the same output.
After filtering out the above, find out:
top 10 users ranked by median number of sentences in the reviews they have pub- lished;
top 10 products ranked by median number of sentences in the reviews they have received;
5 Stage Three: Similarity analysis with Sentence Em- bedding
In this stage, you are asked to perform similarity analysis on the review sentences. The analysis involves segmenting review body into multiple sentences; encoding each sentence as vector so that the distance between pair of sentences can be computed.
5.1 Positive vs. Negative Reviews
For a given product, consider all reviews with star rating 4 and above as positive reviews; and all reviews with star rating 2 and below as negative reviews. You are asked to pick a product from the top 10 products you find in stage One. The positive class is constructed by
extracting all reviews with rate 4 and above
for each review, extracting the review body part and segment it into multiple sen-
tences.
The negative class is constructed in similar manner except that we extract all reviews with rate 2 and below.
Each sentence in the clusters is then embedded with Google Pre-trained universal sen- tence encoder. The result is a 512 dimension vector.
5.2 Intra-Class Similarity
We want to find out if sentences in the same category are closely related with each other. The closeness is measured by average distance between points in the class. In our case, point refers to the sentence encoding and pair-wise distance is measured by Cosine distance. Cosine distance is computed as 1 CosineSimilarity. It has a value between 0 and 2.
3

5.3 Class Center Sentences
Find out the class center and its 10 closest neighbours for positive and negative class respectively. We define class center as the point that has the smallest average distance to other points in the class. Again in this case point refers to the sentence encoding and pair-wise distance are measured by Cosine distance.
The result should show the text of the center sentence, the review id it belongs to and its 10 closest neighbouring sentences text and their respective review id.
6 Stage Four: Similarity analysis with Spark supported Feature Extractors
In this stage, you are asked to perform the same similar analysis as in stage three, with different embedding mechanism. You are asked to use one of the Spark supported feature extractors, such as TF-IDF or word2Vec to convert review sentences into vectors before carrying out the similarity analysis.
7 Group Collaboration
This is a group assignment. Each group can have up to FOUR members. You are asked to use git to collaborate among members, you can use the universitys hosting service, GitHub or BitBucket. Once you have created the repository on a hosting platform, you should give your tutor access to your repository.
Group members are expected to make fair contribution to the project. You are NOT recommended to divide the work based on stages as the amount of work required in each stage varies a lot. If members of your group do not contribute sufficiently you should alert your tutor as soon as possible. The tutor has the discretion to scale the groups mark for each member based on their contribution to the project.
8 Coding Requirements
All stages must be implemented with Spark. You can use a combination of Spark API. Your Spark task can load external packages or libraries to perform simple task such as sentence segmentation, or distance calculation.
We use stage to define various workload. This does not necessary mean you need to implement each in a separate scripts. It is possible to implement two stages in one script, or to implement one stage in two or more scripts.
You code must run on EMR cluster. You can decide on the capacity of the cluster based on your data set and implementation. You can store intermediate results back on S3, provided they are generated by Spark application.
4

9 Deliverable
There are two deliverables: source code and project report (around 10 pages). Both are due on Wednesday 22nd of May 23:59 (Week 12). There will be a demo demo in week 12 during tutorial time. ALL members of a group must attend the demo and explain individual contribution. Group members who do not attend the demo will not receive any mark, unless he(she) has been granted special permission for not attending from course coordinator.
[TBD: submission details]
10 Report Structure
TBD
11 Reference
Amazon Customer Reviews Dataset
5

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Spark statistic School of Computer Science Dr. Ying Zhou
$25