lecture/12-em-annotated.pdf
lecture/13-poe-annotated.pdf
lecture/14-xai-annotated.pdf
lecture/lecture1-annotated.pdf
lecture/lecture10.pdf
lecture/lecture11.pdf
1/24
Outline
Latent Variable Models
The Expectation Maximization Procedure
Gaussian Mixture Models
K-Means Clustering
Kernel K-Means
2/24
Motivation
PCA of Iris dataset PCA of Boston dataset
PCA of Diabetes dataset PCA of Digits dataset
Complex data cannot be modeled accurately by standard probability distributions
(e.g. Gaussian) and require a more complex description, e.g. with latent variables.
3/24
Latent Variable Models
A latent variable model (LVM) is a description of the data distribution that makes
use of a layer of abstraction (latent variables) to improve modeling capability.
p(x|z,)
p(z|)
dataxdata
Classical
descripon
Latent variable
descripon
p(x|)
x
z
Relation between the classsical and latent variable description:
p( |) =
zZ
p(z |) p( |z , ).
4/24
Learning a Latent Variable Model
Assuming a dataset D where we assume that the samples have been drawn i.i.d..
In that case, the log-likelihood of the data according to the model can be written
as:
logP(D|) = log
D
p( |)
=
D
log p( |)
and we wish to maximize that quantity. Injecting the latent variable model, the
objective to optimize becomes
=
D
log
zZ
p(z |) p( |z , )
Problem: The maximum of log p(D|) cannot be found analytically.
5/24
Learning a Latent Variable Model
Strategy: If the function p(D|) cannot be easily optimized, find a lower-bound
of that function that is easier to optimize.
6/24
Building a Lower-Bound
Jensens inequality
For any element of the d-dimensional simplex (i.e. 0 and 1 = 1, and
any concave function f : Rd R, we have
f (
d
i=1
iai )
d
i=1
i f (ai )
Application to the latent variable model:
Let be some probability distribution q(z |) independent from .
Let i containing the remaining terms.
Because the latent variable model
D log
zZ p(z |) p( |z , ) does
not contain such distribution q(z |) independent from , create it!
7/24
Building a Lower-Bound
Step 1: Add q(z |) both on the numerator and denominator:
log p(D|) =
D
log
z
p(z |) p( |z , )
q(z |) q(z |)
Step 2: Applying the Jensen inequality
D
z
log
p(z |) p( |z , )
q(z |)
q(z |)
and verify the bound:
=
D
z
log
p( |) p(z | , )
q(z |)
q(z |)
=
D
z
log
p( |)
q(z |)
log P(D|)
D
z
log
q(z |)
p(z | , )
q(z |)
KL(q(z|) p(z|,)) 0
8/24
Building a Lower-Bound
Question: How to ensure that
KL(q(z |) p(z | , ))
is small, so that maximizing the lower-bound with gives a good approximation
of the true log-likelihood?
Chicken & egg problem:
If we use a simple q, e.g. uniformly distributed, we get a loose
lower-bound from which we get a bad .
To get a tight lower-bound, we need to choose q(z |) p(z | , ) which
requires that we know .
9/24
The Expectation-Maximization (EM) Algorithm
Strategy: Start with some random solution and alternately estimate q and ,
until we converge.
10/24
The Expectation-Maximization (EM) Algorithm
0 random()
q1(z |) p(z | , 0) (E-step)
1 argmax
D
z
log
p( , z |)
q1(z |)
q1(z |) (M-step)
q2(z |) p(z | , 1) (E-step)
2 argmax
D
z
log
p( , z |)
q2(z |)
q2(z |) (M-step)
11/24
The Expectation-Maximization (EM) Algorithm
q
(0,q1)
(n,qn)
Properties of the algorithm:
Block coordinate descent
Locally optimal step size
The algorithm lands in a local
minimum of the function p(D|).
Advantages of EM compared to gradient
descent:
no learning rate
no need to compute the gradients.
12/24
Application: Cluster Analysis
PCA of Iris dataset PCA of Boston dataset
PCA of Diabetes dataset PCA of Digits dataset
Question: Can we build a model of the cluster structure of the data?
13/24
The Gaussian Mixture Model (GMM)
The Gaussian Mixture Model is a latent variable model specialized for
clustering.
p(x|z,)
p(z|)
data
Latent variable
descripon
(general
formulaon)
x
z
p(x|z=i,)
p(z=i|)
data
Gaussian
mixture
model
x
i=1 i=2 i=3 i=4
The latent variable z of a GMM is an integer between 1 and C . Each
value of z indicates a cluster of the data.
For each cluster, we associate a different data-generating distribution
(e.g. Gaussian with different means and covariances).
14/24
The Gaussian Mixture Model (GMM)
The GMM is formally defined by the two equations:
p(z = i |) = i
p( |z = i , ) = (2)d/2 |i |1/2 exp
1
2
( i )
1
i (x i )
The parameters of the model to be learned are:
The mixing coefficients (i )Ci=1 subject to i 0 and
C
i=1
i = 1
The mean vectors (i )
C
i=1.
The covariances (i )Ci=1 subject to positive semi-definiteness.
Various simplifications of the GMM model can be used in practice, e.g. for
speed, statistical robustness, or simplicity of implementation.
diagonal/isotropic/tied/fixed covariance matrices,
fixed mixing coefficients, etc.
15/24
EM for the GMM (simplified)
Consider the simplified GMM:
p(z = i |) = 1
C
p( |z = i , ) = 1
(/)d/2
exp( i
2
)
E-step: (Apply Bayes rule)
q(z = i |) = p(z = i |) p( |z = i , )
p( |) =
exp( i2)C
i=1
exp( i2)
M-step: (Set lower-bound gradient to zero)
D
C
i=1
log(exp( i
2
)) q(z = i |) = 0
i =
D q(z = i |)
D q(z = i |)
16/24
Implementing GMMs
The (simplified) GMM model can be implemented easily in few lines of code:
17/24
GMM Demo (1st try)
18/24
GMM Demo (2nd try)
19/24
Gaussian Mixture Model (Summary)
Number of clusters need to be determined in advance.
EM converge to some local minima after a few iterations
EM is non-convex, initialization matters.
For a better result, run the algorithm multiple times and retain the best
solution.
20/24
The K-Means Algorithm
Cluster assignments step
c() = argmin
k
k
2
Centroid update step
k =
:c()=k
:c()=k
1
K-means can be interpreted as the limit case of EM for Gaussian
distributions with .
K-means produces hard-assignments (i.e. data points belong to a single
cluster).
K-means has the same limitations as GMM (non-convex, predefined
cluster shape).
K-means has only one parameter: the number of clusters.
21/24
Kernelizing K-Means
Problem: Standard k-means requires clusters to have round
shape. It fails on banana datasets.
Idea: Replace by the nonlinear feature map () and de-
fine i =
j
ij( j), i.e. centroids become weighted com-
binations of data points.
1. Cluster assignments step:
c() = argmin
i
() i
2
= argmax
i
2
j
ijk( , j) i Ki
2. Centroid update step:
i = argmin
i
:c()=i
() i
2
= K
1
:c()=i
k(X , )
:c()=i
1
Standard k-means
Kernel k-means
22/24
Other Cluster Algorithms
Deep k-means
Hierarchical clustering
Divisive clustering
Agglomerative clustering
DBScan
Affinity Propagation
23/24
Beyond Clustering
Mixture model
(todays lecture)
Product of experts
(next weeks lecture)
24/24
Summary
Latent variable models is a powerful approach to model complex data
distribution.
Expectation-maximization is a framework to efficiently optimize these
models.
Latent variable models can be used for clustering, e.g. the Gaussian
mixture model.
The well-known k-means algorithm can be seen as a limit case of the
Gaussian mixture model.
K-means can be kernelized to produce nonlinear boundaries between
clusters.
1/24
Outline
Covered content:
Products of Experts (PoE)
Restricted Boltzmann Machine (RBM)
Structure of an RBM
RBM learning algorithms
Application of RBMs
Reference:
Hinton, Geoffrey E. (2002). Training Products of Experts by Minimizing Contrastive
Divergence (PDF). Neural Computation. 14 (8): 17711800
2/24
Beyond Clustering
Mixture model
(last weeks lecture)
Product of experts
(todays lecture)
3/24
Mixture model vs. Product of experts
Mixture Model: Mixture models are commonly used for clustering problems and the
probability is defined as a weighted sum of probability distributions:
p(x|, ) =
C
k=1
k p(x|k)
with k denoting the mixing coefficients subject to the constraints
C
k=1 k = 1 and k 0.
Product of Experts (PoE): Products of experts are commonly used for learning dis-
tributed representations (e.g. unsupervised feature extraction) and are defined as a product
of functions.
p(x|) = 1Z
H
j=1
g(x|j)
jth
expert
where Z is a normalization term set to ensure that p(x|) integrates to 1.
4/24
Example 1: Product of Gaussians
Consider the experts to be univariate Gaussians spe-
cialized on a particular input dimension
g(x|i) =
1
(22i )
1/2
exp
(xi i)
2
22i
The product of experts (PoE) can be developed as:
p(x|) =
d
i=1
g(x|i)
=
1
(2)d/2||1/2
exp
1
2
(x )1(x )
with = diag(21 , . . . ,
2
d), i.e. a multivariate Gaussian
distribution without correlation between features.
5/24
Example 2: Product of Gaussian Mixtures
Let the experts be the univariate mixtures:
g(x|i) =
C
k=1
1/2 exp((xi ik)2)
The product of experts (PoE) can be developed as:
p(x|) = 1Z
d
i=1
C
k=1
1/2 exp((xi ik)2)
=
1
Z
C
k1=1
C
kd=1
d
i=1
1/2 exp((xi iki)2)
multivariate Gaussian
i.e. a mixture of exponentially many (Cd) multivariate
Gaussians. Therefore, a PoE can encode many varia-
tions with few parameters.
6/24
Example 3: Product of t-Student Distributions
Define experts to be t-Student distributions in some
projected space (z = wj x):
g(x|j) =
1
j + (w
j x)
2
wj = 1
The resulting product of experts
p(x|) = 1Z
H
j=1
1
j + (w
j x)
2
withH the number of experts, produces a non-Gaussian
multivariate distribution, which can be useful to model
e.g. image or speech data. This PoE has connections
to other analyses, e.g. independent component analysis
(ICA).
7/24
The Restricted Boltzmann Machine
input
latent
0 1 0 0 1
0 1 0 0 1 1
The restricted Boltzmann machine (RBM) is a joint probability model defined over input
features x {0, 1}d and latent variables h {0, 1}H .
p(x,h|) = 1Z exp
d
i=1
H
j=1
xiwijhj +
H
j=1
bjhj
The parameter wij can be interpreted as the connection strength between input feature xi
and latent variable hj . The larger wij the stronger xi and hj co-activate.
8/24
The Restricted Boltzmann Machine (PoE View)
Connection between RBM and PoE
The RBM, when marginalized over its hidden units, has the structure of a Product of
Experts with g(x, j) = (1 + exp(w
j x+ bj)).
Proof:
p(x|) =
h{0,1}d
p(x,h|)
=
h{0,1}d
1
Z
exp
d
i=1
H
j=1
xiwijhj +
H
j=1
bjhj
=
1
Z
h{0,1}d
H
j=1
exp((wj x+ bj) hj)
=
1
Z
H
j=1
hj{0,1}
exp((wj x+ bj) hj)
=
1
Z
H
j=1
(1 + exp(wj x+ bj))
9/24
Interpreting the RBM Experts
The experts
g(x, j) = 1 + exp(w
j x+ bj)
forming the RBM implement two behaviors:
wj x+ bj > 0 : g(x; j) 1: the example x is in
the area of competence of the expert and the
latter speaks in favor of x.
wj x+ bj < 0 : g(x; j) 1: the example x isoutside the experts area of competence, and theexpert withdraws, i.e. multiplies by 1.This double behavior is important to implement thenonlinearity.10/24RBM as a Neural NetworkThe product of expert (PoE) modelp(x|) = 1ZHj=1(1 + exp(wj x+ bj))can be rewritten as:log p(x|) =Hj=1log(1 + exp( softpluswj x+ bj)) f(x,) logZwhere f(x, ) can be interpreted as a neural network.xfsoftplusw(x)Note: The neural network can predict which examples x are more probable relative toother examples, but not the actual probability score, because logZ is difficult to compute.11/24RBM as a Neural NetworkRBM trained on MNIST digits (only digits 3 with 100-125 foreground pixels).2% least likely 2% most likely Observations: Digits with anomalies are predicted the RBM to be less likely than clean digits.12/24RBM as a Neural NetworkUniversal approximationThe RBM is an universal approximator of data distributions in the binary input space{0, 1}d.Proof by construction: Add a softplus neuron pointing to each corner of the hypercube.Scale the weight wj according to the probability of each corner, and choose the bias accord-ingly.Note: This construction requires exponentially many neurons. In practice, each expert jcaptures more than a single corner of the hypercube (i.e. learn general features).13/24RBM as a Neural NetworkRBM weights ((wj)Hj=1) K-Means centroidsObservation In the RBM, experts (wj)j encode only part of the digit (e.g. a stroke). The experttherefore contributes to make all data containing that particular stroke more likely. The experts that the RBM extracts can be used for transfer learning, e.g. reused topredict different digits/characters. K-means centroids are much more localized, and consequently less transferable.14/24Learning an RBMRecap: mixture model (EM bound)The mixture model can be optimized by finding the optimum of a lower-bound at eachiterationlog p(D|) =Nn=1logCk=1kp(xn|k)kk Nn=1Ck=1logkp(xn|k)kkwhere N is the number of data points, and (k)k and (k)k are distributions over the Cmixture elements.Question: Can the same EM approach be used for Product of Experts?log p(D|) =Nn=1f(xn, ) logZAnswer: No, the expression cannot be bounded in the same way as for the mixture model(it has a different structure).15/24Gradient of the RBMIdea: Although EM is not applicable, we can still compute the gradient of the log-likelihoodand perform gradient descent.log p(D|) = Nn=1f(xn, ) logZ=Nn=1f(xn, ) logx{0,1}dexp(f(x, ))=Nn=1 f(xn, )x{0,1}d exp(f(x, ))f(x, )x{0,1}d exp(f(x, ))=Nn=1f(xn, )NExp(x|) f(x, )The gradient is a difference between a data-dependent and a model-dependent term.16/24RBM Update RuleBased on the gradient calculation above, we can build the update rule: + 1NNn=1f(xn, ) Exp(x|) f(x, )Interpretation: The first term makes the data more likely according to the model. The second term makes everything that the model considers likely, less likely. The training procedure terminates when we reach some equilibrium.datap(x)data17/24Computation of the RBM update rule + 1NNn=1f(xn, ) Exp(x|) f(x, )Observations The left hand side is easy to compute (backprop in f(x, ), averaged over all datapoints). The right hand side is more tricky, because p(x|) is defined over exponentially manystates. An unbiased approximation of the expected gradient can be obtained bygenerating a sample {x1, . . . ,xm} from the model distribution p(x|).Question: How do we sample from p(x|)? Idea: Switch back to the classical (i.e. non-PoE) view of the RBM, where we can uselatent variables to ease sampling.18/24Recap: Classical View of the RBMinputlatent0 1 0 0 10 1 0 0 1 1The RBM is a probability model defined over input features x {0, 1}d andand latent variables h {0, 1}H .p(x,h|) = 1Z exp di=1Hj=1xiwijhj +Hj=1bjhjThe parameter wij can be interpreted as the connection strength betweeninput feature xi and latent variable hj . The larger wij the stronger xi and hjco-activate.19/24Conditional Distributions in an RBMSampling p(x) and p(h) is hard, however, hidden variables h can be sampled easily andindependently when we condition on the state x, and similarly for x conditioned on h. Letzj =i xiwij + bj . We proceed as:p(h|x, ) =1Z expjzjhjh1Z expjzjhj =jexp(zjhj)jhjexp(zjhj)=jexp(zjhj)1 + exp(zj) p(hj |x,)and we observe that, conditioned on x, the latent variables (hj)j can be sampled easily(Bernoulli distributions) and independently.By symmetry of the RBM model, we get a similar result for the visible units conditioned onthe latent variables, i.e. p(xi|h, ) = exp(zihi)/(1 + exp(zi)) with zi =j wijhj .20/24Block Gibbs SamplingObservation: We know p(hj |x, ) and p(xi|h, ).But how do we sample p(x|), which we need tocompute the RBM gradient?Block Gibbs Sampling: Start with some ran-dom input x(0), then sample alternately:h(0) p(hj |x(0), )Hj=1x(1) p(xi |h(0), )di=1h(1) p(hj |x(1), )Hj=1x(2) p(xi |h(1), )di=1…until x() converges in distribution to p(x|).21/24Fast ApproximationsIn practice, Gibbs sampling can take a long time toconverge. Instead, we can use fast approximationsof converged Gibbs sampling.Common approximation strategies: Contrastive divergence (CD-k): Start froman existing data point, and perform k stepsof alternate Gibbs sampling. Persistent contrastive divergence (PCD):Start from the Gibbs sample x in theprevious iteration of gradient descent, andperform one step of Gibbs sampling.22/24Application: RBM for Data DenoisingSuppose you receive an example xnoisy, and wouldlike to denoise it. A reconstruction of that digit canbe obtained by mapping to the latent variables andprojecting back:1. Projection on latent variablesh = sigm(Wxnoisy + b)2. Backprojection on the input domainxrec = sigm(Wh)Before denoisingAfter denoising23/24Application: RBM for Representation LearningThe RBM model can be used as a neural net-work initialization to achieve lower generaliza-tion error on some classification task. Step 1: Create a layer of features basedon the RBM learned parameters:(x) =sigm(w1 x)…sigm(wHx) Step 2: Add a top layer that maps(x) to the class probabilities, and trainthe resulting neural network end-to-endusing gradient descent.MNIST example:Source: Erhan et al. (2010) Why Does Unsuper-vised Pre-training Help Deep Learning?24/24Summary The Product of Experts is an unsupervised learning approach that is substantiallydifferent from Mixture Models. Product of experts such as the RBM are optimized by gradient descent (instead ofEM). The RBM has several desirable properties: Simple gradient, block Gibbs sampler(quite fast). The RBM can be used for various tasks, e.g. probability modeling, data denoising,representation learning.1/24Outline What is Explainable AI? Desiderata of an Explainable AI technique Uses of Explainable AI Methods for Explainable AI Activation Maximization Shapley Values Taylor Expansions Layer-wise Relevance Propagation2/24What is Explainable AI?Standard machine learning: The function f is typically considered to be a black-box whose parameters are learnedfrom the data using e.g. gradient descent. The objective to minimize encourages thepredictions f (x) to coincide with the ground truth on the training and test data.x1f(x)x2xdx…Machine learning + Explainable AI: We do not only look at the outcome f (x) of the prediction but also at the way theprediction is produced by the ML model, e.g. which features are used, how these featuresare combined, or to what input pattern the model responds the most.3/24What is Explainable AI?Example 1: Synthesize an input pattern that most strongly activates the output of the MLmodel associated to a particular class.Image source: Nguyen et al. (2016) Multifaceted Feature Visualization: Uncovering the Different Types ofFeatures Learned By Each Neuron in Deep Neural Networks4/24What is Explainable AI?Example 2: Highlight features that have contributed for a given data point to the ML prediction.image image image imageheatmap(bike)heatmap(person)heatmap(cat)heatmap(person)image image imageheatmap(train)heatmap(train)heatmap(dining table)Image source: Lapuschkin et al. (2016) Analyzing Classifiers: Fisher Vectors and Deep Neural Networks5/24What is Explainable AI?Example 3: Concept activation vectors (TCAV). Highlight the mid-level concepts that explain,for a given data point, the ML prediction.Source: Google Keynote19 (URL: https://www.youtube.com/watch?v=lyRPyRKHO8M&t=2279s)6/24Desiderata of an ExplanationIn practice, we would like the explanation technique to satisfy a numberof properties:1. Fidelity: The explanation should reflect the quantity beingexplained and not something else.2. Understandability: The explanation must be easilyunderstandable by its receiver.3. Sufficiency: The explanation should provide sufficientinformation on how the model came up with its prediction.4. Low Overhead: The explanation should not cause the predictionmodel to become less accurate or less efficient.5. Runtime Efficiency: Explanations should be computable inreasonable time.see also Swartout & Moore (1993), Explanation in Second Generation Expert Systems.imageheatmap (train)7/24Uses of an ExplanationVerify (and improve?) a ML model Verify that the model is based on featureswhich generalize well to examples outsidethe current data distribution (this cannotbe done with standard validationtechniques!). Reliance of the ML models on wrongfeatures is often encountered when thereare spurious correlation in the data. From the explanation, the modelstrustworthiness can be reevaluated, andthe flawed ML model can be potentiallyretrained based on the user feedback.8/24Uses of an ExplanationExample: The classifier is right for the wrong reasonsimage heatmap (horse)aer bic bir boa bot79.08 66.44 45.90 70.88 27.64bus car cat cha cow69.67 80.96 59.92 51.92 47.60din dog hor mot per58.06 42.28 80.45 69.34 85.10pot she sof tra tvm28.62 49.58 49.31 82.71 54.33average precision of the Fisher Vectormodel on the PascalVOC dataset In this example, the classifier accurately predicts the horse class, but based on the wrongfeatures (some copyright tag in the corner). This incorrect decision strategy cannot be detected by just looking at the test error.cf. Lapuschkin et al. (2019) Unmasking Clever Hans Predictors and Assessing What Machines Really Learn.Nature Communications9/24Uses of an ExplanationLearn something about the data(or about the system that produced the data) Step 1: Train a ML model that predictswell the data. Step 2: Apply XAI to the trained MLmodel to produce explanations of the MLdecision strategy. Step 3: Based on the XAI explanations,the user can compare his reasoning withthat of the ML model, and can potentiallyrefine his own domain knowledge.Image source: Thomas et al. (2019) AnalyzingNeuroimaging Data Through Recurrent DeepLearning Models10/24Part II: Methods of XAIPresented methods Activation maximization Shapley values Taylor expansion Layer-wise relevance propagationOther methods Surrogate models (LIME) Integrated gradients / expected gradients / SmoothGrad Influence functions . . .11/24Activation MaximizationAssume a trained a ML model (e.g. a neural network), and we would like to understand whatconcept is associated to some particular output neuron of the ML model, e.g. the output neuronthat codes for the class cat. Activation maximization proceeds in two steps: Step 1: Think of the ML model as a function of the input Step 2: Explain the function f by generating a maximally activating input pattern:x = arg maxxf (x, )12/24Activation MaximizationProblem: In most cases f (x) does not have single pointcorresponding to the maximum.E.g. in linear models, f (x) = wx + b, wecan keep moving the point x further along thedirection w , and the output continues to grow).Therefore, we would like to apply a preference for regularregions of the input domain, i.e.x = arg maxxf (x) + (x)In practice, the preference can be for data points withsmall norm (i.e. we set (x) = x2 so that pointswith large norm are penalized.)13/24Activation Maximization: Examplesf (x) = wx + b and (x) = x2 f (x) = max(x1, x2) and (x) = x214/24Activation Maximization: Probability ViewAssume the model produces a log-probability for class c :f (x) = log p(c |x)The input x that maximizes this function can be inter-preted as the point where the classifier is the most sureabout class c .Choose the regularizer (x) = log p(x), i.e. favor pointsthat are likely.The optimization problem becomes:x = arg maxxlog p(c |x) + log p(x)= arg maxxlog p(x|c)where x can now be interpreted as the most typical inputfor class c .15/24Attribution of a Prediction to Input FeaturesMLblackboxprediconinputaribuon1. The data x Rd is fed to the ML black-box and we get a prediction f (x) R.2. We explain the prediction by determining the contribution of each input feature.Key property of an explanation: conservation (di=1 i = f (x)).16/24Attribution: Shapley Values Framework originally proposed in the contextof game theory (Shapley 1951) for assigningpayoffs in a cooperative game, and recentlyapplied to ML models. Each input variable is viewed as a player, andthe function output as the profit realized bythe cooperating players.The Shapley values 1, . . . , d measuring the contribution of each feature are:i =S : i /S|S|!(d|S|1)!d!f (xS{i}) f (xS)where (xS)S are all possible subsets of features contained in the input x.17/24Attribution: Shapley ValuesRecall: i =S : i /S|S|!(d|S|1)!d! Sf (xS{i}) f (xS) SWorked-through example: Consider the function f (x) = x1 (x2 + x3). Calculate the contri-bution of each feature to the prediction f (1) = 1 (1 + 1) = 2.18/24Attribution: Taylor Expansions Many ML models f (x) are complex andnonlinear when taken globally but are simpleand linear when taken locally. The function can be approximated locally by some Taylor expansion:f (x) = f (x) +di=1[f (x)]i (xi xi) i+ . . . First-order terms i of the expansion can serve as an explanation. The explanation (i)i depends on the choice of root point x.19/24Attribution: Taylor ExpansionsExample: Attribute the prediction f (x) = wx with x Rd on the d input features.20/24Attribution: Taylor ExpansionsLimitations: Gradient information is too local-ized. Cannot handle saturation effects anddiscontinuities e.g. cannot explain thefunctionf (x) =di=1xi max(0, xi )at the point x = (2, 2).This limitation can be overcome by looking atthe structure of the model and decompose theproblem of explanation in multiple parts ( nextslide).21/24Attribution: Look at the Structure of the Modela4a3a5a6x1x2yout-1111-1-11-111wij vjObservation: The function implemented by a ML modelis typically a composition of simpleelementary functions. These functions are simpler to analyzethan the whole input-output function.Idea: Treat the problem of explanation aspropagating the prediction backward inthe input-output graph. The layer-wise relevance propagation(LRP) method implements this approachand can be used to explain ML models( next slide).22/24Attribution: The LRP Methoda4a3a5a6x1x2yout-1111-1-11-111wij vjExample: Consider yout to be the quantity toexplain:Rout yout Step 1: Propagate on the hidden layer6j=3 : Rj ajvj6j=3 ajvjRout Step 2: Propagate on the first layer2i=1 : Ri 6j=3xiwij2i=1 xiwijRjNote: Other propagation rules can be engineered, and choosing appropriate propagation rulesis important to ensure LRP works well for practical neural network architectures.23/24Attribution: The LRP MethodEffect of LRP rules on the explanation (e.g. class castle predicted by a VGG-16 neural network.)VGG-16 Network3x3 @ 643×3 @ 643×3 @ 1283×3 @ 1283×3 @ 2563×3 @ 2563×3 @ 2563×3 @ 5123×3 @ 5123×3 @ 5127×1 @ 40961×1 @ 40961×1 @ 10003×3 @ 5123×3 @ 5123×3 @ 512LRP-0LRP-LRP-LRP-0LRP-LRP-24/24Summary Explainable AI is an important addition to classical ML models (e.g. for validating a MLmodel or extracting knowledge from it). Many XAI methods have been developed, each of them, with their strengths andlimitations: Activation maximization can be used to understand what a ML model has learned,but is unsuitable for explaining an individual prediction f (x). Shapley value has strong theoretical foundations, but is computationally unfeasiblefor high-dimensional input data. Taylor expansions are simple and theoretically founded for simple models, but theexpansion does not extrapolate well in complex nonlinear models. LRP leverages the structure of the ML model to handle nonlinear decision functions,but requires to carefully choose the propagation rules.1/29Machine Learning 1, Course Outline Covered topics Bayes Decision Theory, Parameter Estimation Component/Discriminant Analyses (PCA, Fisher) Model Selection, Learning Theory SVMs, Regression, Neural Nets, Boosting, Decision Trees Clustering, Latent Variable Models Weekly homeworks Elective course (only for Machine Learning 1-X)2/29Machine Learning 1, Important DatesRegistering homework groups Request need to be sent before Friday 6 November 2020 at 14:00.First homework submission deadline Monday 9 November 2020 at 14:00.Exam registration Opens in January 2021. To be able to register, you need to firstpass the prerequisites (homeworks / elective) and then request anUbungsschein/Genehmigung.Exams March 2021, two sessions (you can choose the session you prefer),exact exam dates still to be determined.3/29Textbook ReferenceDuda et al. Pattern Classification, 2nd Edi-tion (2000) The first four lectures will be basedon this textbook. This week: Sections 2.1-2.6;2.94/29Example of a Machine Learning ClassifierInformation about the input (fishon the conveyor belt) is firstacquired through sensors (e.g.wage, camera, etc.)The data is preprocessed (e.g.cropping, rescaling, normaliza-tion).Then a finite number of featuresis extracted.Finally, a classifier can be learnedon these features.Image source: Duda et al. (2000)5/29Single-Feature Example (Length)6/29Two-Feature Example (Length + Lightness)7/29Bayesian Decision TheoryGoals: Merging empirical observations with prior knowledge about thetask (e.g. prior class probabilities). Incorporating distributional assumptions (e.g. Gaussian-generateddata, independence of features). Build optimal classifiers (in terms of classification accuracy or interms of costs).8/29Bayesian Decision TheoryNotation: 1,2, . . . Different classes (e.g. salmon, sea bass, …) x Rd vector of observations(e.g. x1 is the length and x2 is the lightness).we now add probabilities … P(j ): Probability of being of class j . p(x): Data density function (e.g. histogram) p(x |j ): Class-conditioned data density function (e.g. histogram). P(j | x): Probability of being of a certain class after observing x.9/29Bayes TheoremP(j | x) =p(x |j ) P(j )p(x)Image source: Duda et al. (2000)10/29Optimally Accurate ClassifierDecide1 if P(1 | x) > P(2 | x)
2 else.
Alternate formulations:
11/29
Multivariate Normal Distributions
P(x|j ) =
1
(2)ddet(j )
exp
1
2
(x j )1j (x j )
j is the mean (center of the
data distribution)
j is the covariance
(elongation of the data
distribution).
Image source: Duda et al. (2000)
12/29
Optimal Classifier for Gaussians (1 = 2)
Note: logP(x|j ) = d2 log 2
1
2
log det() 1
2
(x j )1(x j )
13/29
Optimal Classifier for Gaussians (1 = 2)
Image source:
Duda et al. (2000)
Decision boundary is linear and oriented by mean and covariance.
Offset is controlled by class prior probabilities.
14/29
Optimal Classifier for Gaussians (1 = 2)
When covariances 1 and 2
are not the same, the decision
boundary is quadric instead of
linear. Quadrics include circle,
ellipse, parabola, hyperbola,
and degenerate forms.
Image source: Duda et al. (2000)
15/29
Binary Data Distributions
Assume that our data is now binary: x {0, 1}d . With each dimension
generated independently according to some parameters p and q.
P(xi = 0 |1) = 1 pi P(xi = 0 |2) = 1 qi
P(xi = 1 |1) = pi P(xi = 1 |2) = qi
The probability of the whole multivariate observation can be written as:
P(x|1) =
d
i=1
pixi + (1 pi ) (1 xi )
Question: How to express the decision boundary
P(1|x) > P(2|x)
16/29
Optimal Classifier for Binary Data (1st try)
Recall: P(x|1) =
d
i=1
pixi + (1 pi ) (1 xi )
17/29
Optimal Classifier for Binary Data
Observation:
For binary data, the data likelihood
P(x|1) =
d
i=1
pixi + (1 pi ) (1 xi )
can be equivalently expressed as
P(x|1) =
d
i=1
p
xi
i (1 pi )
(1xi )
18/29
Optimal Classifier for Binary Data (2nd try)
Recall: P(x|1) =
d
i=1
p
xi
i (1 pi )(1xi )
19/29
From Binary to Multi-Class
The two-class problem: = (1,2)
decide
1 if P(1 | x) > P(2 | x)
2 else,
can be generalized to multiple classes.
Let = (1,2, . . . ,C ) be our C possible classes. The optimal decision
now takes the form:
decide i if j =i : P(i |x) > P(j |x)
or equivalently
decide i where i = argmax
j
P(j |x)
20/29
Example: Multi-Class Gaussian
Decision function:
decide i
i = argmax
j
P(j |x)
Observation:
The decision surface of a multi-class
classifier can be quite complex and
nonlinear.
Image source: Duda et al. (2000)
21/29
From Classification to Cost Minimization
Example: Suppose you would like to purchase a second-hand car. After
observing the car, you assess that it has a defect with probability
P(defect | x) = 0.1 P(no defect | x) = 0.9
Concretely, the decision you need to take is not to classify the whether
the car has a defect or not, but whether to buy the car or not.
For this, we need to evaluate the cost of each scenario, e.g.
cost ( buy | defect ) = 100.0 cost ( buy | no defect ) = 20.0
cost ( not buy | defect ) = 0.0 cost ( not buy | no defect ) = 0.0
and take the action with the lowest expected cost.
22/29
From Classification to Cost Minimization
Classification Accuracy Maximization:
classify as i where i = argmax
j
P(j |x)
Expected Cost Minimization: Let (i )i be the set of actions. The
expected cost of taking a certain action is given by
R(j |x) =
C
k=1
(j |k)P(k |x)
where (j ,k) is the cost of action j given the class k . We then
decide to
take action i where i = argmin
j
R(j |x)
23/29
Classification Accuracy as a Special Case
Recall: R(j |x) =
C
k=1 (j |k)P(k |x)
Show that the problem of maximum accuracy classification is a special
instance of expected cost minimization with a particular set of actions
(j )j and a particular cost function (j |k).
24/29
Analyzing the Two-Class Case
Recall: R(j |x) =
C
k=1 (j |k)P(k |x)
Show that in the two-action two-class case, changing the cost function
means shifting the decision boundary.
(Note: when differences of lambdas are negative,
we need to stop before applying the log)
25/29
Measuring Classification Error
So far, we have studied what the decision boundary should be in order to
predict optimally.
However, in certain cases, it is also important to determine what is the
expected error of the classifier (e.g. to determine whether the classifier
is good enough for practical use).
The expected error is the probability that the data is of a different class
than the one predicted:
P(Err | x) =
P(1 | x) if decide 2
P(2 | x) if decide 1
For a maximally accurate classifier, this reduces to
P(Err | x) = min{P(1 | x),P(2 | x)}
26/29
Measuring Classification Error
The expected error of this maximally accurate classifier is computed as
the integral of its error probability over the distribution p(x).
P(Err) =
x
P(Err | x)p(x)dx
=
x
min{P(1 | x),P(2 | x)}p(x)dx
This is also known as the Bayes error rate.
Generally, this integral cannot be solved analytically, because of the min
function. Error must instead be bounded or evaluated empirically.
27/29
Bounding the Error of a Classifier
Very basic bound
Observe for binary classification that P(1 | x) = 1 P(2 | x).
The error can be bounded as:
P(Err) =
x
min{P(1 | x),P(2 | x)}p(x)dx
x
0.5 p(x)dx = 0.5
The optimal classifier predicts the correct class at least 50% of the time.
28/29
Slightly Better Bound
Recall: P(j | x) =
p(x |j ) P(j )
p(x)
The error can be bounded as:
P(Err) =
x
min{P(1 | x),P(2 | x)}p(x)dx
=
x
min{p(x |1)P(1), p(x |2)P(2)}dx
x
max{p(x |1), p(x |2)}min{P(1),P(2)}dx
2min{P(1),P(2)}
Additional insight: The optimal classifier improves its accuracy when one
class prior probability is strongly dominant over to the other class.
29/29
Summary
Bayesian decision theory allows to build optimal classifiers based on
prior knowledge about the classes and the data distributions.
For certain simple data distributions, optimal classifiers are linear.
The framework is general enough to extend to multi-class
scenarios and user-defined cost functions.
Error of the optimal classifier is typically harder to compute than
the actual decision function, but it is still possible to compute
bounds on the error.
2020_TUB10
Wojciech Samek & Grgoire Montavon
ML1 Lecture 10: Kernel Ridge Regression 2
ML1 Lecture 10: Kernel Ridge Regression 3
ML1 Lecture 10: Kernel Ridge Regression 4
ML1 Lecture 10: Kernel Ridge Regression 5
ML1 Lecture 10: Kernel Ridge Regression 6
ML1 Lecture 10: Kernel Ridge Regression 7
ML1 Lecture 10: Kernel Ridge Regression 8
ML1 Lecture 10: Kernel Ridge Regression 9
ML1 Lecture 10: Kernel Ridge Regression 10
ML1 Lecture 10: Kernel Ridge Regression 11
ML1 Lecture 10: Kernel Ridge Regression 12
ML1 Lecture 10: Kernel Ridge Regression 13
ML1 Lecture 10: Kernel Ridge Regression 14
ML1 Lecture 10: Kernel Ridge Regression 15
ML1 Lecture 10: Kernel Ridge Regression 16
ML1 Lecture 10: Kernel Ridge Regression 17
ML1 Lecture 10: Kernel Ridge Regression 18
ML1 Lecture 10: Kernel Ridge Regression 19
ML1 Lecture 10: Kernel Ridge Regression 20
ML1 Lecture 10: Kernel Ridge Regression 21
ML1 Lecture 10: Kernel Ridge Regression 22
ML1 Lecture 10: Kernel Ridge Regression 23
ML1 Lecture 10: Kernel Ridge Regression 24
ML1 Lecture 10: Kernel Ridge Regression 25
ML1 Lecture 10: Kernel Ridge Regression 26
ML1 Lecture 10: Kernel Ridge Regression 27
ML1 Lecture 10: Kernel Ridge Regression 28
ML1 Lecture 10: Kernel Ridge Regression 29
ML1 Lecture 10: Kernel Ridge Regression 30
ML1 Lecture 10: Kernel Ridge Regression 31
ML1 Lecture 10: Kernel Ridge Regression 32
ML1 Lecture 10: Kernel Ridge Regression 33
ML1 Lecture 10: Kernel Ridge Regression 34
ML1 Lecture 10: Kernel Ridge Regression 35
ML1 Lecture 10: Kernel Ridge Regression 36
ML1 Lecture 10: Kernel Ridge Regression 37
ML1 Lecture 10: Kernel Ridge Regression 38
ML1 Lecture 10: Kernel Ridge Regression 39
ML1 Lecture 10: Kernel Ridge Regression 40
ML1 Lecture 10: Kernel Ridge Regression 41
ML1 Lecture 10: Kernel Ridge Regression 42
ML1 Lecture 10: Kernel Ridge Regression 43
ML1 Lecture 10: Kernel Ridge Regression 44
ML1 Lecture 10: Kernel Ridge Regression 45
Reviews
There are no reviews yet.