Microsoft Word Assignment 2.docx
INF 553 Spring 2017
Assignment 2: Frequent Itemsets
Deadline: 02/22 2017 11:59 PM PST
Assignment Overview
This assignment contains one main algorithm. You will implement the SON algorithm using the Apache
Spark Framework. You can use any Spark version that is greater or equal to 1.6 but you need to
specify the version in your description file. You will use three different datasets, ranging from very
small to very large. This will help you to test, develop and optimize your algorithm given the number of
records at hand. More details on the structure of the datasets and instructions on how to use the input files
will be explained in details in the following sections. The goal of this assignment is to make you
understand how you can apply the frequent itemset algorithms you have learned in class on a large
number of data and more importantly how you can make your implementation more performant and
efficient in a distributed environment.
Write your own code!
For this assignment to be an effective learning experience, you must write your own code!
Do not share code with other students in the class!!
Heres why:
The most obvious reason is that it will be a huge temptation to cheat: if you include code written
by anyone else in your solution to the assignment, you will be cheating. As mentioned in the
syllabus, this is a very serious offense, and may lead to you failing the class.
However, even if you do not directly include any code you look at in your solution, it surely will
influence your coding. Put another way, it will short-circuit the process of you figuring out how
to solve the problem, and will thus decrease how much you learn.
So, just dont look on the web for any code relevant to these problems. Dont do it.
Submission Details
For this assignment you will need to turn in a Python, Java, or Scala program depending on your language
of preference. We will test your code using the same datasets but with different support thresholds values.
This assignment will surely need some time to be implemented so please plan accordingly and start early!
Your submission must be a .zip file with name:
your submission should be identical as shown below. The Firstname_Lastname_Description.pdf file
contains helpful instructions on how to run your code along with other necessary information as described
in the following sections. The OutputFiles directory contains the deliverable output files for each problem
and the Solution directory contains your source code.
SON Algorithm
In this assignment we implement the SON Algorithm to solve every problem (Problems 1 to 3) on top of
Apache Spark Framework. We will rely on the fact that SON can process chunks of data in order to
identify the frequent itemsets. You will need to find all the possible combinations of the frequent
itemsets for any given input file that follows the format of the MovieLens Ratings Datasets. In order to
accomplish this task, you need to read Chapter 6 from the Mining of Massive Datasets book and
concentrate on section 6.4 Limited-Pass Algorithms. Inside the Firstname_Lastname_Description.pdf
file we need you to describe the approach you used for your program. Specifically, in order to process
each chunk which algorithm did you use, A-Priori, MultiHash, PCY, etc
At the end of the assignment, Appendix A provides some more guidelines that will help you with the
implementation and Appendix B specifies how to organize your Description pdf file.
For assignment 1 you used the Spark framework and most probably at this point you have a better
understanding of the MapReduce operations. You can write your program in Python, Java or Scala. For
this assignment you will need to find the collection of frequent itemsets of rated movies using the
MovieLens dataset with which you are already familiar from homework 1. You will need to compute the
frequent itemsets using SON algorithm, initially for a synthetic testing dataset (Problem 1) which
resembles the MovieLens ratings.csv file, then for the ml-latest-small/ratings.csv dataset (Problem 2)
and finally for the ml-20m/ratings.csv (Problem 3).
The MovieLens datasets can be found in the following link:
You will download two datasets. The first one is the ml-20m.zip and the second is the ml-latest-small.zip.
Once you extract the zip archives you will find multiple data files. From those files for this assignment we
only need the ratings.csv file from each zip archive.
We would like to compute two cases of possible frequent itemsets using the testing and ratings.csv files.
Case 1
We would like to calculate the combinations of frequent movies (as singletons, pairs, triples, etc) that
were rated and are qualified as frequent given a support threshold value.
In order to apply this computation, we will need to create a basket for each user containing the ids of the
movies that were rated by this user. If a movie was rated more than one time from a user, we consider that
this movie was rated only once. More specifically, the movie ids are unique within each basket are
unique. The generated baskets are similar to:
user1 = (movie11, movie12, movie13, )
user2 = (movie21, movie22, movie23, )
user3 = (movie31, movie32, movie33, )
Case 2
In the second case we want to calculate the combinations of frequent users (as singletons, pairs, triples,
etc) who can be qualified as frequent given a support threshold value.
In order to apply this computation, we will need to create a basket for each movie containing the ids of
the users who rated this movie. If a movie was rated more than one time from a user, we consider it was a
rated only once by this user. More specifically, the user ids are unique within each basket. The generated
baskets are similar to:
movie1 = (user11, user12, user13, )
movie2 = (user21, user22, user23, )
movie3 = (user31, user32, user33, )
Finally, in the section Problem 1, we will describe explicitly how you should run your program, and
what should be the format of your expected output.Everything that is described in section
Problem 1 must be applied to the subsequent sections as well (i.e., Problem 2 and Problem 3)
Problem 1 (20 Points)
Implementing SON using Spark with the Testing Dataset
Under the /Data folder of the assignment you will find two small sample datasets. The
Data/movies.small1.csv dataset can be used to verify the correctness of your implementation. We will
also require you to submit, for each of the two above cases, one output for evaluation for the
Data/movies.small2.csv dataset, as described in the following Deliverables section.
Execution Requirements
Input Arguments:
1. Case Number: An integer value specifying which case from the ones we just described we want
to compute the frequent itemsets. The input is an integer value. 1 for case 1 and 2 for case 2.
1. Input.csv: This is the path to the input ratings file containing all the transactions. Each line
corresponds to a transaction. Each transaction has items that are comma separated. For Problem 1
you can use the Data/movies.small1.csv file to test the correctness of your algorithm.
2. Support: Integer that defines the minimum count to qualify as a frequent itemset.
Output:
A file in the format shown in the snapshot of the Execution Example section below. In particular, for
each line you should output the frequent itemsets you found for the current combination followed by
an empty line after each combination. The printed itemsets must be sorted in ascending order. A
higher level description of this format is:
(frequent_singleton1), (frequent_singleton2), , (frequent_singletonK)
(frequent_pair1), (frequent_pair2), , (frequent_pairM)
(frequent_triple1), (frequent_triple2), , (frequent_tripleN)
Execution Example
The first argument passed to our program in the below execution is the case number we want to run the
algorithm against. The second input is the path to the ratings input file and the third is the support
threshold value. Following we present examples of how you can run your program with spark submit both
when your application is a Java/Scala program or a Python script.
A. Example of running a Java/Scala application with spark-submit:
Notice that the argument class of the spark-submit specifies the main class of your application
and it is followed by the jar file of the application.
For Case 1
For Case 2
B. Example of running a Python application with spark-submit:
Case 1
Case 2
The solution of the above execution for case 1 is similar to the following snapshot. Since both the
movie ids and the user ids are integers the format of the output will be the same in both cases:
Solution of A.Case1 and B.Case1 Snaphosts, with input case number 1 , input file movies.small1.csv and support
threshold equal to 4:
Deliverables for Problem 1
1. Script or Jar File and Source Code
Please name your Python script as:
Or if you submit a jar file as:
The python script or the .jar file of your implementation should be inside the
Solution directory of your submission. You must also include a directory, any
directory name is fine, with your source code inside Solutions.
2. Output Files
We need two output files for Problem 1.
For case 1, run your program against Data/movies.small2.csv dataset with support 3.
For case 2, run your program against Data/movies.small2.csv dataset with support 4.
The format of the output should be exactly the same as the above snapshot for both cases.
The names of the output files should be as:
The above output files should be placed inside the OutputFiles directory of your
submission.
3. Description
Inside the Firstname_LastName_Description pdf document please write the command
line that you used with spark-submit in order to run your code. Specify also the Spark
version that you use to write your code. If it is a jar file, please specify the name of the
main class of your app as shown in the above snapshots. We will use this in order to
rerun your code against different support values if needed.
Problem 2 (60 Points)
Implementing SON using Spark with the MovieLens Small Dataset
The requirements for Problem 2 are similar to Problem 1. However, here we would like to check for the
performance of our implementation for a larger dataset. We would like to find the frequent itemsets
among a larger number of records. For this purpose, a good indicator of how well our algorithm works is
the total execution time. In this execution time we take into account also the time of reading the files
from the disk.Following, we provide a table of execution time for two threshold values for each case
described in the first section. You can use this array as an evaluation metric of your implementation.
CASE 1 CASE 2
Support Threshold Execution Time Support Threshold Execution Time
120 <100secs 180 <555secs 150 <70secs 200 <450secsDeliverables for Problem 2 1. Output FilesWe need four output files for Problem 2.For case 1, run your program against the ml-latest-small/ratings.csv dataset with support 120 and 150.For case 2, run your program against the ml-latest-small/ratings.csv dataset with support 180 and 200.The format of the output should be exactly the same as the one for Problem 1.The output files should be named as:
The above output files should be placed inside the OutputFiles directory of your
submission.
2. Description
Inside the Firstname_LastName_Description.pdf document of your submission please
include a table that is exactly the same with the one provided on the top of this section.
You must use the same support threshold values as the table above and include the
execution times of your implementation for each case. We will rerun your code so make
sure the times you record on the table are the ones corresponding to your implementation.
Grade breakdown
a. Four correct output files (5pts each)
b. Your execution time needs to be smaller than the ones in the table (5pts each)
c. Your execution time needs to be less than 2/3 of the ones in the table (5pts each)
Problem 3 (20 Points)
Implementing SON using Spark with the MovieLens Big Dataset
For the last part of this assignment we will run our application using the big MovieLens dataset, located
inside the ml-20m/ratings.csv of the downloaded zip file. Since the purpose of this assignment is to test
the efficiency and see how you can optimize your implementation we also provide some execution times
similarly as we did in Problem 2. In this execution time we take into account also the time of reading the
files from the disk.
CASE 1 CASE 2
Support Threshold Execution Time Support Threshold Execution Time
29000 <700secs 2500 <650secs 30000 <600secs 3000 <580secsDeliverables for Problem 3 1. Output FilesWe need four output files for Problem 3.For case 1, run your program against the ml-20m/ratings.csv dataset with support29000 and 30000.For case 2, run your program against the ml-20m/ratings.csv dataset with support2500 and 3000. The format of the output should be exactly the same as the one for Problem 1.The output files should be named as:
The above output files should be placed inside the OutputFiles directory of your
submission.
2. Description
Inside the Firstname_LastName_Description.pdf document of your submission please
include a table that is exactly the same with the one provided on the top of this section.
You must use the same support threshold values as the table above and include the
execution times of your implementation for each case. Dont make up the numbers
because we will rerun your program.
Grade breakdown
a. Four correct output files (2.5 pts each)
b. Your execution time needs to be smaller than the ones in the table (2.5 pts each)
Bonus (5pts): Describe why did we need to use such a large support threshold and where do you
think there could be a bottleneck that could result in a slow execution for your implementation, if
any.
General Instructions:
1. Make sure your code compiles before submitting
2. Make sure to follow the output format and the naming format.
Grading Criteria:
1.If your programs cannot be run with the commands you provide, your submission will be graded
based on the result files you submit and 20% penalty for it.
2. If the files generated by your programs are not sorted based on the specifications, there will be
20% penalty.
3.If your program generates more than one file, there will be 20% penalty.
4. If you dont provide the source code and just the .jar file in case of a Java/Scala application
there will be 60% penalty.
5. If your submission does not state inside the Description pdf file how to run your code, which
Spark version you used and which approach you followed to implement your algorithm
there will be a penalty of 30%.
6.There will be 20% penalty for late submission.
APPENDIX A
You need to take into account the Monotonicity of the Itemsets
You need to leverage Spark capabilities of processing partitions/chunks of data and analyze the
data within each partition.
You need to reduce the support threshold according to the size of your partitions.
You should emit appropriate (key, value) pairs in order to speed up the computation time.
Try to avoid data shuffling during your execution.
Pay great attention on the thresholds number for each case. The lower the threshold the more the
computation. Do not try arbitrary threshold values. Try testing values within the given ranges.
APPENDIX B
Please include the following information inside your description document.
Succinctly describe your approach to implement the algorithm.
Command line command to execute your program
Problem 2 execution table
Problem 3 execution table
Reviews
There are no reviews yet.