In this homework, you will be working with the Yelp dataset. It contains 19 attributes: 15 diescrete and 4 continuous ({latitude, longitude, reviewCount, checkins}). We already did some exploring with this dataset in homework 1. Your task for this homework is to implement the k-means algorithm and apply it to the continuous attributes in the data.
Your submission will be run on a hidden dataset which is different from given dataset. The hidden dataset will have the same column names for the continuous attributes as the given dataset.
The features you will use are the 4 continuous attributes in yelp.csv.
1.2 Skeleton Code
You are provided a skeleton structure with this homework:
cs373-hw5/ handout.pdf data/ given/ yelp.csv
src/ init .py kmeans.py report/ report.pdf
Do not modify the given folder structure. You should not need to, but you may, add any extra files of code in cs373-hw5/src/ folder. You should place your report (report.pdf) in the cs373-hw5/report/ folder. You must not modify init .py. All your coding work for this homework must be done inside cs373-hw5/src/ folder. You are not allowed to use any external libraries for classifier implementations such as scikit-learn etc. Make sure that you understand the given code fully, before you start coding.
Your code will be tested using python2.7 from inside the cs373-hw5/src/ folder. If you wish to use python3.6, you need to place an empty file named python3 in your src/ folder. Make sure that you test your code on data.cs.purdue.edu before submitting.
1.3 Expected Output
Your python script should take three arguments as input:
- trainingDataFileName: corresponds to a subset of the data that should be used as the training set for your algorithm.
- K: the value of k to use when clustering.
- clustering option: takes one of the following five values, 1 (use the four original attributs for clustering, which corresponds to Q3.1), 2 (apply a log transform to reviewCount and checkins, which corresponds to Q3.2), 3 (use the standardized four attributes for clustering, which corresponds to Q3.3, 4 (use the four original attributes and Manhattan distance for clustering, which corresponds to Q3.4), and 5 (use 3% random sample of data for clustering, which corresponds to Q3.5).
Your code should read in the training sets from the csv file, cluster the training set using the specified value of k, and output the within-cluster sum of squared error and cluster centroids. For the centroid of each cluster report the values for each of the four attributes in the following order. {latitude, longitude, reviewCount, checkins}
The expected output is given below. Note that this will run your algorithm with the yelp.csv file, a K value of 4, and clustering option 1. Please make sure you follow this output else you WILL loose points. Note that this is only a sample output and the numbers are not representative of the actual results.
$ python kmeans.py yelp.csv 4 1 WC-SSE=15.2179
Centroid1=[49.00895,8.39655,12,3]
CentroidK=[33.33548605,-111.7714182,9,97]
P.T.O
2 Kmeans
2.1 Theory
- (10 points) What are the benefits of the k-means clustering algorithm? What are the issues? In which situations should we use k-means? Your answer should compare to other algorithms learned in class (both unsupervised and supervised) in less than 4 sentences.
2.2 Implementation
You need to implement the k-means clustering algorithm for this part. This part could be completed by editing only kmeans.py. You need to follow the description of the models discussed in the lecture slides (link) with the following specifications.
Features: Consider the 4 continuous attributes in yelp.csv for X.
Distance: Use Euclidean distance unless otherwise specified.
Score function: Use within-cluster sum of squared error (where rk is the centroid of cluster Ck, d is the distance function.).
K
wc(C) = XX d(x(i),rk)2 (1)
k=1x(i)Ck
Make sure to also implement the following cluster options as described in Section 1.4.
- The four original attributes for clustering (corresponding to 3.1).
- A log transform to reviewCount and checkins (corresponding to 3.2).
- Standardize the 4 attributes for clustering (corresponding to 3.3).
- Four original attributes and Manhattan distance for clustering (corresponding to 3.4).
- A random sample of the data for clustering (corresponding to 3.5).
Report the results obtained on the given train set in your report.
3 Analysis
You only need to include your plots and discussions in your report. Make sure that the code you submit doesnt include any changes you dont want to be included.
- (10 points) Cluster the Yelp data using k-means.
- Use a random set of examples as the initial centroids.
- Use values of K = [3,6,9,12,24].
- Plot the within-cluster sum of squares (wc) as a function of K.
- Choose an appropriate K from the plot and argue why you choose this particular K.
- For the chosen value of K, plot the clusters with their centroids in two ways: first using latitude longitude and second using reviewCount, checkins. Discuss whether any patterns are visible.
- (10 points) Do a log transform of reviewCount, checkins. Describe how you expect the transformation to change the clustering results. Then repeat the analysis (1). Discuss any differences in the results.
P.T.O
- (10 points) Transform the four original attributes so that each attribute has mean = 0 and stdev = 1. You can do this with the numpy functions, numpy.mean() and numpy.std() (i.e., subtract mean, divide by stdev). Describe how you expect the transformation to change the clustering results. Then repeat the analysis (1). Discuss any differences in the results.
- (10 points) Use Manhattan distance instead of Euclidean distance in the algorithm. Describe how you expect the change in the clustering results. Then repeat the analysis (1). Discuss any differences in the results.
- (10 points) Take a 6% random sample of the data. Describe how you expect the downsampling to change the clustering results. Then run the analysis (i) five times and report the average performance. Specially, you should use a single random 6% sample of the data. Then run 5 trials where you start k-means from different random choices of the initial centroids. Report the average wc when you plot wc vs. K. For your chosen K, determine which trial had performance closest to the reported average. Plot the centroids from that trial. Discuss any differences in the results and comment on the variability you observe.
- (10 points) Improve the score function. To evaluate the clustering, it is not sufficient to measure only the within-cluster sum of squares (wc) that you used above. It is also desired to have each cluster separate from others as much as possible. To improve the resulting clustering, define your own score function that takes into account not only the compactness of the clusters but also the separation of the clusters. Write a formal mathematical expression of your score function and explain why you think your score function is better than the within-cluster sum of squares. Also, using the best configuration from Questions 1-5, plot the results of your score function for K = [3, 6, 9, 12, 24], and compare the results to the appropriate algorithm from Question 1-5.
4 Time Limit
Your code must terminate with in 8 minutes for each clustering model with 4 or less clusters. If it doesnt terminate with in 8 minutes, you will be graded for the output your code generates at the end of the 8 minute time limit. If your code doesnt converge by then, it would be a good idea to print the results you have at that point or use a different convergence criteria.
Reviews
There are no reviews yet.