[SOLVED] CS计算机代考程序代写 algorithm deep learning Outline

30 $

File Name: CS计算机代考程序代写_algorithm_deep_learning_Outline.zip
File Size: 574.62 KB

SKU: 1792201902 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Outline
Covered content:
� Products of Experts (PoE)
� Restricted Boltzmann Machine (RBM) � Structure of an RBM
� RBM learning algorithms
� Application of RBMs
Reference:
� Hinton, Geoffrey E. (2002). ”Training Products of Experts by Minimizing Contrastive Divergence” (PDF). Neural Computation. 14 (8): 1771–1800
1/24

Beyond Clustering
Mixture model Product of experts (last week’s lecture) (today’s lecture)
2/24

Mixture model vs. Product of experts
Mixture Model: Mixture models are commonly used for clustering problems and the probability is defined as a weighted sum of probability distributions:
�C
αk ·p(x|θk)
with αk denoting the mixing coefficients subject to the constraints �Ck=1 αk = 1 and αk ≥ 0.
Product of Experts (PoE): Products of experts are commonly used for learning dis- tributed representations (e.g. unsupervised feature extraction) and are defined as a product
of functions.
p(x|θ) =
where Z is a normalization term set to ensure that p(x|θ) integrates to 1.
p(x|α,θ)=
k=1
1 �H
g(x|θj ) Z j=1 � �� �
jth expert
3/24

Example 1: Product of Gaussians
Consider the experts to be univariate Gaussians spe- cialized on a particular input dimension
g(x|θi) = 1 exp � − (xi − μi)2 � (2π σi2 )1/2 2σi2
The product of experts (PoE) can be developed as:
�d
p(x|θ) = g(x|θi)
i=1 � � = 1 exp − 1(x−μ)�Σ−1(x−μ)
with Σ = diag(σ12 , . . . , σd2 ), i.e. a multivariate Gaussian distribution without correlation between features.
(2π)d/2|Σ|1/2 2
4/24

Example 2: Product of Gaussian Mixtures
Let the experts be the univariate mixtures:
�C k=1
π−1/2 exp(−(xi − μik)2)
g(x|θi) =
The product of experts (PoE) can be developed as:
1 �d �C
p(x|θ) = Z π−1/2 exp(−(xi − μik)2)
i=1 k=1
1�C �C�d
= Z ··· π−1/2 exp(−(xi −μiki)2) k1=1 kd=1 i=1
� �� �
multivariate Gaussian
i.e. a mixture of exponentially many (Cd) multivariate Gaussians. Therefore, a PoE can encode many varia- tions with few parameters.
5/24

Example 3: Product of t-Student Distributions
Define experts to be t-Student distributions in some projected space (z = wj�x):
g(x|θj) = 1 �wj� = 1 α j + ( w j� x ) 2
The resulting product of experts
p ( x | θ ) = 1 �H 1
with H the number of experts, produces a non-Gaussian multivariate distribution, which can be useful to model e.g. image or speech data. This PoE has connections to other analyses, e.g. independent component analysis (ICA).
Zj=1αj +(wj�x)2
6/24

The Restricted Boltzmann Machine
latent 0 1 0 0 1 1
input 01001
The restricted Boltzmann machine (RBM) is a joint probability model defined over input
features x ∈ {0, 1}d and latent variables h ∈ {0, 1}H .
1��d�H �H� p(x,h|θ) = Z exp xiwijhj + bjhj
i=1 j=1 j=1
The parameter wij can be interpreted as the connection strength between input feature xi and latent variable hj. The larger wij the stronger xi and hj co-activate.
7/24

The Restricted Boltzmann Machine (PoE View)
Connection between RBM and PoE
The RBM, when marginalized over its hidden units, has the structure of a Product of Experts with g(x, θj ) = (1 + exp(wj�x + bj )).
Proof:

p(x|θ) = =
p(x, h|θ)
�1��d�H �H�
h∈{0,1}d
h∈{0,1}d i=1 j=1 j=1
Z exp xiwijhj + bjhj 1 � �H
= Z exp((wj�x+bj)·hj) h∈{0,1}d j=1
=1�H � exp((wj�x+bj)·hj) Z j=1 hj ∈{0,1}
1 �H
= Z (1+exp(wj�x+bj))
j=1
8/24

Interpreting the RBM Experts
The experts
g(x, θj ) = 1 + exp(wj�x + bj )
forming the RBM implement two behaviors:
� wj�x+bj >0:g(x;θj)�1: theexamplexisin the area of competence of the expert and the latter speaks in favor of x.
� wj�x+bj <0:g(x;θj)≈1: theexamplexis outside the expert’s area of competence, and the expert ‘withdraws’, i.e. multiplies by 1.This double behavior is important to implement the nonlinearity. 9/24 RBM as a Neural Network The product of expert (PoE) model 1 �Hsoftplusp(x|θ) = Z (1 + exp(wj�x + bj))j=1 x can be rewritten as: �Hj=1 � �� �f ( x ) log(1+exp(wj�x+bj))−logZ � softplus�� �f(x,θ)logp(x|θ) =where f(x,θ) can be interpreted as a neural network.wNote: The neural network can predict which examples x are more probable relative to other examples, but not the actual probability score, because log Z is difficult to compute.10/24∑ RBM as a Neural NetworkRBM trained on MNIST digits (only digits “3” with 100-125 foreground pixels).2% least likely 2% most likelyObservations:� Digits with anomalies are predicted the RBM to be less likely than clean digits.11/24������������������������������������������������������������� ������������������������������� RBM as a Neural Network Universal approximation The RBM is an universal approximator of data distributions in the binary input space {0, 1}d.“Proof” by construction: Add a softplus neuron pointing to each corner of the hypercube. Scale the weight wj according to the probability of each corner, and choose the bias accord- ingly.Note: This construction requires exponentially many neurons. In practice, each expert θj captures more than a single corner of the hypercube (i.e. learn general features). 12/24 RBM as a Neural Network RBM weights ((wj)Hj=1) K-Means centroidsObservation� In the RBM, experts (wj)j encode only part of the digit (e.g. a stroke). The expert therefore contributes to make all data containing that particular stroke more likely.� The experts that the RBM extracts can be used for transfer learning, e.g. reused to predict different digits/characters.� K-means centroids are much more localized, and consequently less transferable.13/24 Learning an RBM Recap: mixture model (EM bound) The mixture model can be optimized by finding the optimum of a lower-bound at each iteration�N logp(D|θ)=n=1log�C α k p ( x n | θ k ) ββk ≥�N �C n=1 k=1log� α k p ( x n | θ k ) �β βkkk=1 kwhere N is the number of data points, and (αk)k and (βk)k are distributions over the C mixture elements.Question: Can the same EM approach be used for Product of Experts? logp(D|θ)=�N �f(xn,θ)−logZ�n=1Answer: No, the expression cannot be bounded in the same way as for the mixture model (it has a different structure).14/24 Gradient of the RBMIdea: Although EM is not applicable, we can still compute the gradient of the log-likelihood and perform gradient descent.∂ logp(D|θ)= ∂ �N �f(xn,θ)−logZ� ∂θ ∂θ n=1∂�N� �� � = ∂θ f(xn,θ)−log exp(f(x,θ))n=1 x∈{0,1}d ����N ∂ x∈{0,1}d exp(f(x,θ)) ∂ f(x,θ) = f(xn,θ)− � ∂θ n=1 ∂θ x∈{0,1}d exp(f(x,θ)) N∂�∂�∂θf(xn,θ)−NEx∼p(x|θ) ∂θf(x,θ)The gradient is a difference between a data-dependent and a model-dependent term.=n=115/24 RBM Update RuleBased on the gradient calculation above, we can build the update rule: Interpretation:θ←θ+γ·�1 �N ∂f(xn,θ)−Ex∼p(x|θ)�∂f(x,θ)�� N n=1 ∂θ ∂θ� The first term makes the data more likely according to the model.� The second term makes everything that the model considers likely, less likely. � The training procedure terminates when we reach some equilibrium.p(x)data data16/24 Computation of the RBM update ruleθ←θ+γ·�1 �N ∂f(xn,θ)−Ex∼p(x|θ)�∂f(x,θ)�� N n=1 ∂θ ∂θObservations� The left hand side is easy to compute (backprop in f(x,θ), averaged over all data points).� The right hand side is more tricky, because p(x|θ) is defined over exponentially many states. An unbiased approximation of the expected gradient can be obtained by generating a sample {x1, . . . , xm} from the model distribution p(x|θ).Question: How do we sample from p(x|θ)?� Idea: Switch back to the ‘classical’ (i.e. non-PoE) view of the RBM, where we can use latent variables to ease sampling.17/24 Recap: ‘Classical’ View of the RBMlatent 0 1 0 0 1 1input 01001The RBM is a probability model defined over input features x ∈ {0, 1}d andand latent variables h ∈ {0, 1}H .1��d�H �H� p(x,h|θ) = Z exp xiwijhj + bjhji=1 j=1 j=1The parameter wij can be interpreted as the connection strength between input feature xi and latent variable hj. The larger wij the stronger xi and hj co-activate.18/24 Conditional Distributions in an RBMSampling p(x) and p(h) is hard, however, hidden variables h can be sampled easily and indep�endently when we condition on the state x, and similarly for x conditioned on h. Let zj= ixiwij+bj.Weproceedas:1 exp��zjhj� �exp(zjhj) �Z j j exp(zjhj) p(h|x,θ)= � 1 exp��z h � = ��exp(z h ) = 1+exp(zj) Z jj jj j����h j j hj p(hj|x,θ)and we observe that, conditioned on x, the latent variables (hj)j can be sampled easily(Bernoulli distributions) and independently.By symmetry of the RBM model, we get a similar result for the visible �units conditioned on thelatentvariables,i.e.p(xi|h,θ)=exp(zihi)/(1+exp(zi))withzi = jwijhj.19/24 Block Gibbs Sampling Observation: We know p(hj |x, θ) and p(xi|h, θ). But how do we sample p(x|θ), which we need to compute the RBM gradient?Block Gibbs Sampling: Start with some ran- dom input x(0), then sample alternately:h(0) ∼ �p(hj | x(0), θ)�Hj=1x(1) ∼ �p(xi | h(0), θ)�di=1h(1) ∼ �p(hj | x(1), θ)�Hj=1x(2) ∼ �p(xi | h(1), θ)�di=1 .until x(·) converges in distribution to p(x|θ). 20/24 Fast ApproximationsIn practice, Gibbs sampling can take a long time to converge. Instead, we can use fast approximations of converged Gibbs sampling.Common approximation strategies:� Contrastive divergence (CD-k): Start from an existing data point, and perform k steps of alternate Gibbs sampling.� Persistent contrastive divergence (PCD): Start from the Gibbs sample x in the previous iteration of gradient descent, and perform one step of Gibbs sampling. 21/24 Application: RBM for Data DenoisingBefore denoising Suppose you receive an example xnoisy, and would like to denoise it. A reconstruction of that digit can be obtained by mapping to the latent variables and projecting back:1. Projection on latent variablesh = sigm(W xnoisy + b)2. Backprojection on the input domain xrec = sigm(W�h)After denoising 22/24 Application: RBM for Representation LearningThe RBM model can be used as a neural net- work initialization to achieve lower generaliza- tion error on some classification task.� Step 1: Create a layer of features based on the RBM learned parameters:MNIST example: sigm(w1� x) φ(x) =  . sigm(wH� x)� Step 2: Add a top layer that maps φ(x) to the class probabilities, and train the resulting neural network end-to-end using gradient descent.Source: Erhan et al. (2010) Why Does Unsuper- vised Pre-training Help Deep Learning?23/24 Summary� The Product of Experts is an unsupervised learning approach that is substantially different from Mixture Models.� Product of experts such as the RBM are optimized by gradient descent (instead of EM).� The RBM has several desirable properties: Simple gradient, block Gibbs sampler (quite fast).� The RBM can be used for various tasks, e.g. probability modeling, data denoising, representation learning.24/24

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS计算机代考程序代写 algorithm deep learning Outline
30 $