DSME6650: Data Mining for Managers Assignment 1
Due Date: 23:59 Jan.19, 2020
Submission Instruction:
For this assignment, please submit electronic version only. For the programming questions, you need to submit BOTH your Python codes and your results. Submit codes as zipped tar file and the output of your program in plain-text file. For other questions, answer them in a word document. You should place the relevant files in their separate directory (preferable A and B for each part).
After that, please compress all your files as one zip named using your student id, e.g., 11xxxxxxxx.zip, and submit to Blackboard.
Part A. Clustering
The following problem is based on lecture 5.
1. Practice the K-means algorithm using Manhattan distance. Suppose there are 15 data points:
A1(1.0, 1.0), A2(1.0, 2.0), A3(2.0, 0.0), A4(2.0, 3.0), A5(5.0, 10.0),
A6(5.0, 12.0), A7(6.0, 2.0), A8(6.0, 4.0), A9(6.0, 10.0), A10(7.0, 2.0),
A11(7.0, 5.0), A12(9.0, 9.0), A13(10.0, 7.0), A14(10.0, 9.0), A15(11.0, 8.0).
Assume we cluster these 15 data points into 4 clusters. The initial seeds are A1, A5, A7, A12.
(1) Pleasedrawalldatapointsandtheinitialseedsinafigurewithgridlineintherangeofx: [0, 12] and y: [-2, 14].
(2) After running the K-means algorithm for 1 epoch only, answer the following questions one by one:
a. The new clusters (i.e., the examples belonging to each cluster)
b. The centers of the new clusters
c. How many more iterations are needed to converge? Please write down the final centers and plot the results.
(3) If we use A7, A8, A10, A11 as initial seeds, how many more iterations are needed to converge? Please write down the final centers and plot the results.
(4) If we apply K-mean++ to the data and choose A1 as the first seed, which point will have the highest probability to be chosen as the second seed?
For these problems, write down your answers for all questions in the word file. For question (2), (3) and (4), you need to write a program to simulate the process of K-means and put your codes in directory A which will be zipped into the final submitted file. Please note that you should write a README.txt file (in the directory A), which indicates how to run the program to get the results you provided (hyper-parameters of the model should be given if needed). Without an appropriate README.txt will be deducted certain scores.
Part B. Graph Representation
The following problem is based on lecture 6.
Figure 1: Figure for Graph Analysis.
Given a small graph shown in Fig. 1, please answer the following questions:
1. Compute the adjacency matrix, the degree matrix, the unnormalized Laplacian matrix, and the normalized Laplacian matrix.
2. You are required to implement a Spectral Clustering method to cluster the clusters of the given graph (k=2) on both unnormalized and normalized Laplacian matrix.
For question 1, write down your answers in a word file. For question 2, you need to write a program to simulate the process of spectral clustering and put your codes in directory B which will be zipped into the final submitted file. Please note that you should write a README.txt file (in the directory B), which indicates how to run the program to get the results you provided (hyper-parameters of the model should be given if needed). Without an appropriate README.txt will be deducted certain scores.
Reviews
There are no reviews yet.