Outline
Covered content:
Products of Experts (PoE)
Restricted Boltzmann Machine (RBM) Structure of an RBM
RBM learning algorithms
Application of RBMs
Reference:
Hinton, Geoffrey E. (2002). Training Products of Experts by Minimizing Contrastive Divergence (PDF). Neural Computation. 14 (8): 17711800
1/24
Beyond Clustering
Mixture model Product of experts (last weeks lecture) (todays lecture)
2/24
Mixture model vs. Product of experts
Mixture Model: Mixture models are commonly used for clustering problems and the probability is defined as a weighted sum of probability distributions:
C
k p(x|k)
with k denoting the mixing coefficients subject to the constraints Ck=1 k = 1 and k 0.
Product of Experts (PoE): Products of experts are commonly used for learning dis- tributed representations (e.g. unsupervised feature extraction) and are defined as a product
of functions.
p(x|) =
where Z is a normalization term set to ensure that p(x|) integrates to 1.
p(x|,)=
k=1
1 H
g(x|j ) Z j=1
jth expert
3/24
Example 1: Product of Gaussians
Consider the experts to be univariate Gaussians spe- cialized on a particular input dimension
g(x|i) = 1 exp (xi i)2 (2 i2 )1/2 2i2
The product of experts (PoE) can be developed as:
d
p(x|) = g(x|i)
i=1 = 1 exp 1(x)1(x)
with = diag(12 , . . . , d2 ), i.e. a multivariate Gaussian distribution without correlation between features.
(2)d/2||1/2 2
4/24
Example 2: Product of Gaussian Mixtures
Let the experts be the univariate mixtures:
C k=1
1/2 exp((xi ik)2)
g(x|i) =
The product of experts (PoE) can be developed as:
1 d C
p(x|) = Z 1/2 exp((xi ik)2)
i=1 k=1
1C Cd
= Z 1/2 exp((xi iki)2) k1=1 kd=1 i=1
multivariate Gaussian
i.e. a mixture of exponentially many (Cd) multivariate Gaussians. Therefore, a PoE can encode many varia- tions with few parameters.
5/24
Example 3: Product of t-Student Distributions
Define experts to be t-Student distributions in some projected space (z = wjx):
g(x|j) = 1 wj = 1 j + ( w j x ) 2
The resulting product of experts
p ( x | ) = 1 H 1
with H the number of experts, produces a non-Gaussian multivariate distribution, which can be useful to model e.g. image or speech data. This PoE has connections to other analyses, e.g. independent component analysis (ICA).
Zj=1j +(wjx)2
6/24
The Restricted Boltzmann Machine
latent 0 1 0 0 1 1
input 01001
The restricted Boltzmann machine (RBM) is a joint probability model defined over input
features x {0, 1}d and latent variables h {0, 1}H .
1dH H p(x,h|) = Z exp xiwijhj + bjhj
i=1 j=1 j=1
The parameter wij can be interpreted as the connection strength between input feature xi and latent variable hj. The larger wij the stronger xi and hj co-activate.
7/24
The Restricted Boltzmann Machine (PoE View)
Connection between RBM and PoE
The RBM, when marginalized over its hidden units, has the structure of a Product of Experts with g(x, j ) = (1 + exp(wjx + bj )).
Proof:
p(x|) = =
p(x, h|)
1dH H
h{0,1}d
h{0,1}d i=1 j=1 j=1
Z exp xiwijhj + bjhj 1 H
= Z exp((wjx+bj)hj) h{0,1}d j=1
=1H exp((wjx+bj)hj) Z j=1 hj {0,1}
1 H
= Z (1+exp(wjx+bj))
j=1
8/24
Interpreting the RBM Experts
The experts
g(x, j ) = 1 + exp(wjx + bj )
forming the RBM implement two behaviors:
wjx+bj >0:g(x;j)1: theexamplexisin the area of competence of the expert and the latter speaks in favor of x.
wjx+bj <0:g(x;j)1: theexamplexis outside the experts area of competence, and the expert withdraws, i.e. multiplies by 1.This double behavior is important to implement the nonlinearity. 9/24 RBM as a Neural Network The product of expert (PoE) model 1 Hsoftplusp(x|) = Z (1 + exp(wjx + bj))j=1 x can be rewritten as: Hj=1 f ( x ) log(1+exp(wjx+bj))logZ softplus f(x,)logp(x|) =where f(x,) can be interpreted as a neural network.wNote: The neural network can predict which examples x are more probable relative to other examples, but not the actual probability score, because log Z is difficult to compute.10/24 RBM as a Neural NetworkRBM trained on MNIST digits (only digits 3 with 100-125 foreground pixels).2% least likely 2% most likelyObservations: Digits with anomalies are predicted the RBM to be less likely than clean digits.11/24 RBM as a Neural Network Universal approximation The RBM is an universal approximator of data distributions in the binary input space {0, 1}d.Proof by construction: Add a softplus neuron pointing to each corner of the hypercube. Scale the weight wj according to the probability of each corner, and choose the bias accord- ingly.Note: This construction requires exponentially many neurons. In practice, each expert j captures more than a single corner of the hypercube (i.e. learn general features). 12/24 RBM as a Neural Network RBM weights ((wj)Hj=1) K-Means centroidsObservation In the RBM, experts (wj)j encode only part of the digit (e.g. a stroke). The expert therefore contributes to make all data containing that particular stroke more likely. The experts that the RBM extracts can be used for transfer learning, e.g. reused to predict different digits/characters. K-means centroids are much more localized, and consequently less transferable.13/24 Learning an RBM Recap: mixture model (EM bound) The mixture model can be optimized by finding the optimum of a lower-bound at each iterationN logp(D|)=n=1logC k p ( x n | k ) k N C n=1 k=1log k p ( x n | k ) kkk=1 kwhere N is the number of data points, and (k)k and (k)k are distributions over the C mixture elements.Question: Can the same EM approach be used for Product of Experts? logp(D|)=N f(xn,)logZn=1Answer: No, the expression cannot be bounded in the same way as for the mixture model (it has a different structure).14/24 Gradient of the RBMIdea: Although EM is not applicable, we can still compute the gradient of the log-likelihood and perform gradient descent. logp(D|)= N f(xn,)logZ n=1N = f(xn,)log exp(f(x,))n=1 x{0,1}d N x{0,1}d exp(f(x,)) f(x,) = f(xn,) n=1 x{0,1}d exp(f(x,)) Nf(xn,)NExp(x|) f(x,)The gradient is a difference between a data-dependent and a model-dependent term.=n=115/24 RBM Update RuleBased on the gradient calculation above, we can build the update rule: Interpretation:+1 N f(xn,)Exp(x|)f(x,) N n=1 The first term makes the data more likely according to the model. The second term makes everything that the model considers likely, less likely. The training procedure terminates when we reach some equilibrium.p(x)data data16/24 Computation of the RBM update rule+1 N f(xn,)Exp(x|)f(x,) N n=1 Observations The left hand side is easy to compute (backprop in f(x,), averaged over all data points). The right hand side is more tricky, because p(x|) is defined over exponentially many states. An unbiased approximation of the expected gradient can be obtained by generating a sample {x1, . . . , xm} from the model distribution p(x|).Question: How do we sample from p(x|)? Idea: Switch back to the classical (i.e. non-PoE) view of the RBM, where we can use latent variables to ease sampling.17/24 Recap: Classical View of the RBMlatent 0 1 0 0 1 1input 01001The RBM is a probability model defined over input features x {0, 1}d andand latent variables h {0, 1}H .1dH H p(x,h|) = Z exp xiwijhj + bjhji=1 j=1 j=1The parameter wij can be interpreted as the connection strength between input feature xi and latent variable hj. The larger wij the stronger xi and hj co-activate.18/24 Conditional Distributions in an RBMSampling p(x) and p(h) is hard, however, hidden variables h can be sampled easily and independently when we condition on the state x, and similarly for x conditioned on h. Let zj= ixiwij+bj.Weproceedas:1 expzjhj exp(zjhj) Z j j exp(zjhj) p(h|x,)= 1 expz h = exp(z h ) = 1+exp(zj) Z jj jj jh j j hj p(hj|x,)and we observe that, conditioned on x, the latent variables (hj)j can be sampled easily(Bernoulli distributions) and independently.By symmetry of the RBM model, we get a similar result for the visible units conditioned on thelatentvariables,i.e.p(xi|h,)=exp(zihi)/(1+exp(zi))withzi = jwijhj.19/24 Block Gibbs Sampling Observation: We know p(hj |x, ) and p(xi|h, ). But how do we sample p(x|), which we need to compute the RBM gradient?Block Gibbs Sampling: Start with some ran- dom input x(0), then sample alternately:h(0) p(hj | x(0), )Hj=1x(1) p(xi | h(0), )di=1h(1) p(hj | x(1), )Hj=1x(2) p(xi | h(1), )di=1 .until x() converges in distribution to p(x|). 20/24 Fast ApproximationsIn practice, Gibbs sampling can take a long time to converge. Instead, we can use fast approximations of converged Gibbs sampling.Common approximation strategies: Contrastive divergence (CD-k): Start from an existing data point, and perform k steps of alternate Gibbs sampling. Persistent contrastive divergence (PCD): Start from the Gibbs sample x in the previous iteration of gradient descent, and perform one step of Gibbs sampling. 21/24 Application: RBM for Data DenoisingBefore denoising Suppose you receive an example xnoisy, and would like to denoise it. A reconstruction of that digit can be obtained by mapping to the latent variables and projecting back:1. Projection on latent variablesh = sigm(W xnoisy + b)2. Backprojection on the input domain xrec = sigm(Wh)After denoising 22/24 Application: RBM for Representation LearningThe RBM model can be used as a neural net- work initialization to achieve lower generaliza- tion error on some classification task. Step 1: Create a layer of features based on the RBM learned parameters:MNIST example: sigm(w1 x) (x) = . sigm(wH x) Step 2: Add a top layer that maps (x) to the class probabilities, and train the resulting neural network end-to-end using gradient descent.Source: Erhan et al. (2010) Why Does Unsuper- vised Pre-training Help Deep Learning?23/24 Summary The Product of Experts is an unsupervised learning approach that is substantially different from Mixture Models. Product of experts such as the RBM are optimized by gradient descent (instead of EM). The RBM has several desirable properties: Simple gradient, block Gibbs sampler (quite fast). The RBM can be used for various tasks, e.g. probability modeling, data denoising, representation learning.24/24
Reviews
There are no reviews yet.