In this assignment, the datasets you are going to use are from: https://drive.google.com/open?id=11anZyiXBfhyICo2mkn6PFnegXgiDiUpF
We generated the following two datasets from the original Yelp review dataset with some filters such as the condition: state == CA. We randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and 20% of the data as the testing dataset.
- csv: the training data, which only include the columns: user_id, business_id, and stars.
- csv: the validation data, which are in the same format as training data.
- We do not share the testing dataset.
1. Tasks
1.1 Task1: Jaccard based LSH
In this task, you will implement the Locality Sensitive Hashing algorithm with Jaccard similarity using yelp_train.csv. You can refer to sections 3.3 3.5 of the Mining of Massive Datasets book.
In this task, we focus on the 0 or 1 ratings rather than the actual ratings/stars from the users. Specifically, if a user has rated a business, the users contribution in the characteristic matrix is 1. If the user hasnt rated the business, the contribution is 0. You need to identify similar businesses whose similarity >= 0.5.
You can define any collection of hash functions that you think would result in a consistent permutation of the row entries of the characteristic matrix. Some potential hash functions are:
f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m
where p is any prime number and m is the number of bins. You can use any combination for the parameters (a, b, p, and m) in your implementation.
After you have defined all the hashing functions, you will build the signature matrix. Then you will divide the matrix into b bands with r rows each, where b x r = n (n is the number of hash functions). You should carefully select a good combination of b and r in your implementation. Remember that two items are a candidate pair if their signatures are identical in at least one band.
Your final results will be the candidate pairs whose original Jaccard similarity is >= 0.5. You need to write the final results into a CSV file according to the output format below.
Example of Jaccard Similarity:
user1 user2 user3 user4
business1 | 0 | 1 | 1 | 1 | |
business2 | 0 | 1 | 0 | 0 | |
Ja | ccard Similarity (business1, business2) = #intersection / #union = 1/3 |
Input format: (we will use the following command to execute your code)
Param: input_file_name: the name of the input file (e.g., yelp_train.csv), including the file path. Param: output_file_name: the name of the output CSV file, including the file path.
Output format:
IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We will not regrade on formatting issues.
- The output file is a CSV file, containing all business pairs you have found. The header is business_id_1, business_id_2, similarity. Each pair itself must be in the alphabetical order. The entire file also needs to be in the alphabetical order. There is no requirement for the number of decimals for the similarity value. Please refer to the format in Figure 2.
Figure 2: a CSV output example for task1
Reviews
There are no reviews yet.