1 Clustering
The code this week provides the three clustering algorithms we discussed: k-means, agglomerative clustering, and
EM for Gaussian mixture models; we will explore the first two here. (These functions are also provided in many
3rd party toolboxes; you are free to use those if you prefer.) In this problem, youll do some basic exploration of the clustering techniques.
- Load the usual Iris data restricted to the first two features, and ignore the class / target variable. Plot the data and see for yourself how clustered you think it looks. Include the plot, and mention how many clusters you think exist (no wrong answer here).
ml.plotClassify2D(None,X,z) |
- Run k-means on the data, for k = 2, k = 5, and k = 20. Try a few (at least 5 each) different initializations and check to see whether they find the same solution; if not, pick the one with the best score. For the chosen assignment for each k, include a plot with the data, colored by assignment, and the cluster centers. You can plot the points colored by assignments using, where z are the resulting cluster assignments of the data. You will have to additionally plot the centers yourself. (15 points)
ml.cluster.agglomerative | from | cluster.py |
- Run agglomerative clustering on the data, using single linkage and then again using complete linkage, each with 2, 5, and then 20 clusters (using). Again, plot with color the final assignment of the clusters. (This algorithm has no initialization issues; so you do not have to try multiple initializations.) (20 points)
- Describe similarities and differences in the results from the agglomerative clustering and k-means. (5 points)
2 EigenFaces
In class we discussed that PCA has been applied to faces, and showed some of the results. Here, youll explore this representation yourself. First, load the data and display a few faces to make sure you understand the data format:
X = np.genfromtxt(data/faces.txt, delimiter=None) # load face dataset plt.figure()# pick a data point i for display img = np.reshape(X[i,:],(24,24)) # convert vectorized data to 2424 image patches plt.imshow( img.T , cmap=gray) # display image patch; you may have to squint |
1
2
3
4
5
- Subtract the mean of the face images (X0 = X ) to make your data zero-mean. (The mean should be of the same dimension as a face, 576 pixels.) Plot the mean face. (5 points)
scipy.linalg.svd |
- Useto take the SVD of the data, so that
X0 = U diag(S) Vh
W = U.dot( np.diag(S) ) |
Since the number of data is larger than the number of dimensions, there are at most 576 non-zero singular values; you can use full_matrices=False to avoid using a lot of memory. As in the slides, we suggest computingso that X0 W Vh. Print the shapes of W and Vh. (10 points)
np.mean( (X0 X0)**2 ) |
- For K = 1 . . . 10, compute the approximation to X0 given by the first K eigendirections, e.g., X0 = W[:, : K] Vh[: K, :], and use them to compute the mean squared error in the SVDs approximation,
. Plot these MSE values as a function of K. (10 points)
2*np.median(np.abs(W[:,j])) |
- Display the first three principal directions of the data, by computing + V[j,:] and V[j,:], where is a scale factor (we suggest, for example,, to get a sense of the scale found in the data). These should be vectors of length 242 = 576, so you can reshape them and view them as face images just like the original data. They should be similar to the images in lecture. (10 points)
- Choose any two faces and reconstruct them using the first K principal directions, for K = 5, 10, 50, 100. (5 points)
- Methods like PCA are often called latent space methods, as the coefficients can be interpreted as a new geometric space in which the data are being described. To visualize this, choose a few faces (say 25), and display them as images with the coordinates given by their coefficients on the first two principal components:
idx = # pick some data (randomly or otherwise); an array of integer indicesimport mltools.transforms coord,params = ml.transforms.rescale( W[:,0:2] ) # normalize scale of W locations plt.figure(); plt.hold(True); # you may need this for pyplot for i in idx:# compute where to place image (scaled W values) & size loc = (coord[i,0],coord[i,0]+0.5, coord[i,1],coord[i,1]+0.5)img = np.reshape( X[i,:], (24,24) ) # reshape to square plt.imshow( img.T , cmap=gray, extent=loc ) # draw each imageplt.axis( (-2,2,-2,2) ) # set axis to a reasonable scale |
1
2
3
4
5
6
7
8
9
10
11
This can often help you get a feel for what the latent representation is capturing. (10 points)
Statement of Collaboration
It is mandatory to include a Statement of Collaboration in each submission, with respect to the guidelines below. Include the names of everyone involved in the discussions (especially in-person ones), and what was discussed.
All students are required to follow the academic honesty guidelines posted on the course website. For programming assignments, in particular, we encourage the students to organize (perhaps using Piazza) to discuss the task descriptions, requirements, bugs in my code, and the relevant technical content before they start working on it. However, you should not discuss the specific solutions, and, as a guiding principle, you are not allowed to take anything written or drawn away from these discussions (i.e. no photographs of the blackboard, written notes, referring to Piazza, etc.). Especially after you have started working on the assignment, try to restrict the discussion to Piazza as much as possible, so that there is no doubt as to the extent of your collaboration.
Reviews
There are no reviews yet.