Outline
Latent Variable Models
The Expectation Maximization Procedure Gaussian Mixture Models
K-Means Clustering
Kernel K-Means
1/24
Motivation
PCA of Iris dataset PCA of Boston dataset
PCA of Diabetes dataset PCA of Digits dataset
Complex data cannot be modeled accurately by standard probability distributions (e.g. Gaussian) and require a more complex description, e.g. with latent variables.
2/24
Latent Variable Models
A latent variable model (LVM) is a description of the data distribution that makes use of a layer of abstraction (latent variables) to improve modeling capability.
Classical descripon
p(x|)
data
Latent variable
descripon p ( z | )
p(x|z,) data
Relation between the classsical and latent variable description:
p(|) = p(z|) p(|z, ). zZ
z
x
x
3/24
Learning a Latent Variable Model
Assuming a dataset D where we assume that the samples have been drawn i.i.d.. In that case, the log-likelihood of the data according to the model can be written as:
log P(D|) = log p(|) D
= logp(|) D
and we wish to maximize that quantity. Injecting the latent variable model, the objective to optimize becomes
= logp(z|)p(|z,) D zZ
Problem: The maximum of log p(D|) cannot be found analytically.
4/24
Learning a Latent Variable Model
Strategy: If the function p(D|) cannot be easily optimized, find a lower-bound of that function that is easier to optimize.
5/24
Building a Lower-Bound
Jensens inequality
For any element of the d-dimensional simplex (i.e. 0 and 1 = 1, and any concave function f : Rd R, we have
d f (
i=1
i ai )
d i=1
i f (ai )
Application to the latent variable model:
Let be some probability distribution q(z|) independent from .
Let i containing the remaining terms.
BecausethelatentvariablemodelDlogzZp(z|)p(|z,)does not contain such distribution q(z|) independent from , create it!
6/24
Building a Lower-Bound
Step 1: Add q(z|) both on the numerator and denominator: log p(D|) = log p(z|) p(|z, ) q(z|)
Step 2: Applying the Jensen inequality
logp(z|)p(|z,)q(z|)
D z q(z|) and verify the bound:
= logp(|)p(z|,)q(z|) D z q(z|)
= logp(|)q(z|)log q(z|) q(z|) D z D z p(z|, )
log P(D|) KL(q(z|) p(z|,)) 0
D z q(z|)
7/24
Building a Lower-Bound
Question: How to ensure that
KL(q(z|) p(z|, ))
is small, so that maximizing the lower-bound with gives a good approximation of the true log-likelihood?
Chicken & egg problem:
If we use a simple q, e.g. uniformly distributed, we get a loose lower-bound from which we get a bad .
To get a tight lower-bound, we need to choose q(z|) p(z|, ) which requires that we know .
8/24
The Expectation-Maximization (EM) Algorithm
Strategy: Start with some random solution and alternately estimate q and , until we converge.
9/24
The Expectation-Maximization (EM) Algorithm
0 random() q1(z|) p(z|, 0)
(E-step)
(M-step)
(E-step) (M-step)
1 argmaxlogp(,z|)q1(z|) q1(z|)
D z q2(z|) p(z|, 1)
2 argmaxlogp(,z|)q2(z|)
.
q2(z|) D z
10/24
The Expectation-Maximization (EM) Algorithm
q
Properties of the algorithm:
Block coordinate descent
Locally optimal step size
The algorithm lands in a local minimum of the function p(D|).
Advantages of EM compared to gradient descent:
no learning rate
noneedtocomputethegradients.
(0,q1)
(n,qn)
11/24
Application: Cluster Analysis
PCA of Iris dataset PCA of Boston dataset
PCA of Diabetes dataset PCA of Digits dataset
Question: Can we build a model of the cluster structure of the data?
12/24
The Gaussian Mixture Model (GMM)
The Gaussian Mixture Model is a latent variable model specialized for clustering.
Latent variable descripon (general formulaon)
p ( z | ) p(x|z,)
data
Gaussian mixture model
p(z=i|)
p(x|z=i,) data
i=1
i=2
i=3
i=4
z
x
The latent variable z of a GMM is an integer between 1 and C. Each value of z indicates a cluster of the data.
For each cluster, we associate a different data-generating distribution (e.g. Gaussian with different means and covariances).
x
13/24
The Gaussian Mixture Model (GMM)
The GMM is formally defined by the two equations:
p(z = i|) = i 1 p(|z=i,)=(2)d/2|i|1/2exp ()1(x)
The parameters of the model to be learned are:
The mixing coefficients (i)Ci=1 subject to i 0 and Ci=1 i = 1 The mean vectors (i)Ci=1.
The covariances (i)Ci=1 subject to positive semi-definiteness.
Various simplifications of the GMM model can be used in practice, e.g. for speed, statistical robustness, or simplicity of implementation.
diagonal/isotropic/tied/fixed covariance matrices, fixed mixing coefficients, etc.
2ii i
14/24
EM for the GMM (simplified)
Consider the simplified GMM:
p(z = i|) = 1
C p(|z = i, ) = 1
exp( i 2)
E-step: (Apply Bayes rule)
(/)d/2
q(z = i|) = p(z = i|)p(|z = i,) = exp( i2) p(|) Ci=1 exp( i 2)
M-step: (Set lower-bound gradient to zero)
C
D i=1
i = D q(z =i|)
log(exp( i 2)) q(z = i|) = 0
D q(z = i|)
15/24
Implementing GMMs
The (simplified) GMM model can be implemented easily in few lines of code:
16/24
GMM Demo (1st try)
17/24
GMM Demo (2nd try)
18/24
Gaussian Mixture Model (Summary)
Number of clusters need to be determined in advance.
EM converge to some local minima after a few iterations
EM is non-convex, initialization matters.
For a better result, run the algorithm multiple times and retain the best solution.
19/24
The K-Means Algorithm
Cluster assignments step
c ( ) = arg min k 2 k
Centroid update step :c()=k k = :c()=k 1
K-means can be interpreted as the limit case of EM for Gaussian distributions with .
K-means produces hard-assignments (i.e. data points belong to a single cluster).
K-means has the same limitations as GMM (non-convex, predefined cluster shape).
K-means has only one parameter: the number of clusters.
20/24
Kernelizing K-Means
Problem: Standard k-means requires clusters to have round shape. It fails on banana datasets.
Idea: Replace by the nonlinear feature map () and de- fine i = j ij(j), i.e. centroids become weighted com- binations of data points.
1. Cluster assignments step: c()=argmin ()i2
i
=argmax 2 ijk(,j)i Ki
i
j
2. Centroid update step:
i =argmin ()i2
Standard k-means
= K
1
:c()=i 1
i
:c()=i
:c()=i k(X,)
Kernel k-means
21/24
Other Cluster Algorithms
Deep k-means
Hierarchical clustering
Divisive clustering
Agglomerative clustering
DBScan
Affinity Propagation
22/24
Beyond Clustering
Mixture model Product of experts (todays lecture) (next weeks lecture)
23/24
Summary
Latent variable models is a powerful approach to model complex data distribution.
Expectation-maximization is a framework to efficiently optimize these models.
Latent variable models can be used for clustering, e.g. the Gaussian mixture model.
The well-known k-means algorithm can be seen as a limit case of the Gaussian mixture model.
K-means can be kernelized to produce nonlinear boundaries between clusters.
24/24
Reviews
There are no reviews yet.