The objective of this lab is for you to explore the behavior of several representative clustering algorithms in Matlab and apply them to some datasets. The TA will first demonstrate the results of the algorithms on a toy dataset and the MNIST dataset. Then, you will replicate those results and further explore the datasets with the algorithms. We provide you with the following:
- The script m sets up the problem (toy dataset or MNIST) and plots various figures. The actual algorithms are implemented in the functions below.
- m, GMEM.m, mean shift.m and cc eball.m: implement clustering by k-means, EM with Gaussian mixtures, mean-shift and connected-components, respectively.
- For several of the algorithms we use the Gaussian mixture tools.
I Datasets
Construct your own toy datasets in 2D, such as Gaussian clusters with more or less overlap, or clusters with curved
shapes as in the 2moons dataset. You will also use the MNIST dataset of handwritten digits .
Since clustering algorithms are unsupervised, they do not use the class labels yn {0,,9}, only the instances x RD (where D = 784). You can use the labels to see if they agree with the resulting clusters found by an algorithm.
II Using clustering algorithms
Algorithms and plots We consider the following algorithms: k-means, EM for Gaussian mixtures with full covariances, mean-shift and connected-components. The figure shows pseudocode for the algorithms. For algorithms that take an infinite number of iterations to converge, we stop them after maxit iterations (e.g. 100) or once the error function changes by less than a small value tol (e.g. 103). With toy datasets in 2D, we plot the following figures:
- For every algorithm: the dataset in 2D, with points colored according to the cluster they belong to.
- For k-means: the value of the error function after each iteration; it should decrease monotonically and stop in a finite number of iterations.
- For EM with Gaussian mixtures: the value of the log-likelihood function after each iteration; it should increase monotonically. To get a hard clustering, we assign point xn to cluster k if p(k|xn) > p(j|xn) j 6= k. To get a soft clustering (which is more informative), we plot p(k|x) itself for each cluster as a function of x R2 (as a
color plot, or as a contour plot for each cluster).
- For mean-shift: the contours of the kernel density estimate and its modes; and, for any given point xn, the value of the density p(x) after each mean-shift iteration (initialized from xn), which should increase monotonically.
- For connected-components: the dataset in 2D, with points connected by edges in the -ball graph.
With the MNIST dataset, try EM with Gaussian mixtures (also k-means) with different K values and plot:
- The mean k of each cluster k = 1,,K, as a grayscale image, with its mixing proportion k = p(k).
- The posterior probabilities p(k|xn) for k = 1,,K for a given digit image xn, plotted as a bar chart.
Exploration Explore each algorithm in different settings. First, using the same dataset:
- Try different values of the user parameter (number of clusters K for k-means and Gaussian mixtures with EM, bandwidth > 0 for mean-shift, scale > 0 for connected-components).
- For algorithms that depend on the initialization (k-means and EM), try different random initializations.
Then, explore the algorithms and plots with different datasets, number of clusters, clusters with more or less overlap, with different shapes, etc. See the end of file lab04.m for suggestions of things to explore.
1
Notes
- Some of these algorithms may be slow (in particular, EM and mean-shift). Use small datasets to get results fast.
- For EM with Gaussian mixtures, we add a small number to the diagonal of each covariance matrix k (e.g. 1010 tr(k)/D) to make k be full rank. We do this right after updating k in the M step.
Figure 1: Pseudocode for k-means, EM for Gaussian mixtures, mean-shift for the Gaussian kernel and connectedcomponents for an -ball graph (using a distance function d(,)). In all cases, the input is a dataset x1,,xN RD and a user parameter: number of clusters K for k-means and Gaussian mixtures with EM, bandwidth > 0 for mean-shift, and scale > 0 for connected-components. For details omitted (), see the lecture notes
2
Reviews
There are no reviews yet.