Textbook Reference
Duda et al. Pattern Classification, 2nd Edition (2000)
This week: Sections 3.13.5
1/25
Recap: Bayes Decision Theory
Recap:
Knowing class priors and class-conditioned data densities, we can infer posterior probabilities using the Bayes theorem
P(j |x)= p(x|j)P(j) p(x)
From there, optimal classifiers can be built, e.g. the maximum accuracy
classifier:
Question:
Can we assume that we know in advance the data densities p(x | j ), or should they be learned from the data?
argmax P(j |x) j
=argmax logp(x|j)+logP(j) j
2/25
Learning p(x | j ): Histogram Approach
Idea:
Build a grid in input space, and for each class build a histogram that counts the number of observations that belong to each bin.
Problem:
The number of bins is sd where s is the number of steps along each dimension and d is the number of dimensions.
Many bins will have zero observations, not because the data probability is truly zero, but because finitely many examples have been observed.
x1 x2
3/25
Learning p(x | j ): Model Approach
Idea:
Assume that p(x|j) is a known parametric form, e.g. N (j , j ) where j,j aretheparametersofthe distribution.
Estimate the parameters of the distribution that best fit the few observations we have.
Advantage:
With the model-based approach, we need to estimate a finite and typically small number of model parameters and not the whole data distribution.
x1 x2
4/25
Maximum Likelihood Estimation
Goal: Let {p(x | ), } be a set of density functions (e.g. Gaussian), where denotes a parameter vector (e.g. mean / covariance). We would like to find the parameter that is the most likely with respect to the data.
Maximum Likelihood (ML) Approach:
Assume that we have a dataset D = (x1,,xN).
Assume that each example xk Rd in the dataset has been is generated independently and from the same density function p(x | ).
In that case, the joint density function can be written as:
p(D|)=
N k=1
p(xk |)
We also call this quantity the likelihood of w.r.t. the dataset D.
The best parameter is then given by = arg max p(D | ).
5/25
Maximum Likelihood, Simple Gaussian Case
Assume the data density is modeled as a univariate Gaussian with unit variance and unknown mean . For a given data point xk , the density function can be written as:
1 1 2 p(xk|)=2exp 2(xk)
Using the independence assumption, the joint density becomes:
N 1 1 2 p(D|)= 2exp 2(xk )
k=1
Question: How to find that maximizes the function p(D | ).
6/25
Finding the Maximum of a Function
For some function f of interest (e.g. the data likelihood) we would like to compute:
=argmax f()
When the function to optimize is continuously differentiable and concave, the maximum is found at the point where the gradient is zero.
f/1
f/2
f ( ) = . . . = 0
f /h
7/25
Maximum Likelihood, Simple Gaussian Case
Observation: The function p(D | ) is not concave with .
Idea: Transform the function in a way that (i) doesnt change its maximum and
(ii) makes it concave.
Applying the logarithm ensures the two properties above:
N 1 1 2 logp(D|)=log 2exp 2(xk )
k=1
=N 1log(2)1(xk)2 k=1 2 2
8/25
Maximum Likelihood, Simple Gaussian Case
Having found the log-likelihood w.r.t. D to be logp(D|)=N 1log(2)1(xk )2,
k=1 2 2
the best parameter can then be found by solving log P(D | ) = 0.
9/25
Maximum Likelihood, Multivariate Case
The log-likelihood of a multivariate Gaussian distribution w.r.t. D is given by N1d 1 1
logp(D|)= 2log (2) det() 2(xk ) (xk ) k=1
Question: Assuming is fixed, what is the optimal parameter vector ?
10/25
Building a Classifier End-to-End with ML
Consider a labeled dataset containing two examples for the first class, and four examples for the second class. Points are in N0 and we assume they are
generated from classes 1 , 2 following geometric unknown parameters 1 and 2 respectively.
probability distributions of
P(2) = 0.5
Known
Unknown
1,2
Known
P(1) = 0.5
P(x =k|1) = (1 1)k 1
D1 = {0,2}
P(x =k|2) = (1 2)k 2
D2 = {0,2,0,1}
11/25
Building a Classifier End-to-End with ML
Recall:
P(1) = 0.5
P(2) = 0.5
P(x =k|1)=(11)k1
P(x =k|2)=(12)k2
D1 ={0,2}
D2 = {0,2,0,1}
12/25
Building a Classifier End-to-End with ML
Recall:
P(1) = 0.5
P(2) = 0.5
P(x =k|1)=(11)k1
P(x =k|2)=(12)k2
D1 ={0,2}
D2 = {0,2,0,1}
13/25
Building a Classifier End-to-End with ML
Recall:
P(1) = 0.5
P(2) = 0.5
P(x =k|1)=(11)k1
P(x =k|2)=(12)k2
D1 ={0,2}
D2 = {0,2,0,1}
14/25
From ML to Bayes Classifiers
ML classifier: Class posteriors are given by:
p(x|i,i)P(i)
p ( x , )
where i is the maximum likelihood parameter for class i .
Bayes classifier: Class posteriors are computed by bypassing the intermediate computation of the parameter , and instead conditioning directly on the data:
P(i |x,D) = p(x|D,i)p(D|i)P(i) p(x | D) p(D)
p(x|D,i) = p(x|i,D,i)p(i |D,i)di
p(i |Di) = p(Di |i)p(i) . p(Di |i)p(i)di
P(i |x,i)=
where and
15/25
From ML to Bayes Classifiers
Simplified Bayes classifier: Applying the Bayes rule internally, assuming that each class generates its dataset independently, assuming that priors are known, leveraging the fact that new data points are also generated i.i.d. and making thedependenceoni implicitthroughthedatasetDi orparameteri,wearrive after some steps at the simplified form:
where and
P(i |x,D) = p(x|Di)P(i) (1) p(x|D)
p(x|Di) = p(x|i)p(i |Di)di (2) p(i |Di) = p(Di |i)p(i) . (3)
p(Di |i)p(i)di
We will use this simplified form in the following slides and also in the homeworks.
16/25
Building a Classifier End-to-End with Bayes
Recall: we consider the following data generating process
Known
Unknown
1,2
Known
P(1) = 0.5
P(2) = 0.5
P(x =k|1) = (1 1)k 1
D1 = {0,2}
P(x =k|2) = (1 2)k 2
D2 = {0,2,0,1}
17/25
Building a Classifier End-to-End Bayes
Recall:
P(1) = 0.5
P(2) = 0.5
P(x =k|1)=(11)k1
P(x =k|2)=(12)k2 D1 ={0,2}
D2 = {0,2,0,1} and we further set:
p(1) U(0,1) p(2) U(0,1)
18/25
Building a Classifier End-to-End Bayes
Recall:
P(1) = 0.5
P(2) = 0.5
P(x =k|1)=(11)k1
P(x =k|2)=(12)k2 D1 ={0,2}
D2 = {0,2,0,1} and we further set:
p(1) U(0,1) p(2) U(0,1)
19/25
Building a Classifier End-to-End Bayes
Recall:
P(1) = 0.5
P(2) = 0.5
P(x =k|1)=(11)k1
P(x =k|2)=(12)k2 D1 ={0,2}
D2 = {0,2,0,1} and we further set:
p(1) U(0,1) p(2) U(0,1)
20/25
Building a Classifier End-to-End Bayes
Recall:
P(1) = 0.5
P(2) = 0.5
P(x =k|1)=(11)k1
P(x =k|2)=(12)k2 D1 ={0,2}
D2 = {0,2,0,1} and we further set:
p(1) U(0,1) p(2) U(0,1)
21/25
ML vs. Bayes Classifiers
Observations:
ML and Bayes classifiers do not always produce the same decisions
Bayes classifiers are influenced by the prior distribution and are
consequently less sensitive to the data.
Bayes classifiers will tend to favor the outcome that is supported by a larger amount of data.
22/25
ML vs. Bayes: Gaussian Case
Consider the simple data density p(x | ) = N (, 2) with unknown parameter . We would like to compare the ML and Bayes approaches to estimate the parameter from some dataset D = {x1, . . . , xn}.
ML approach:
The maximum likelihood estimate is given by = 1 n
n k=1
empirical mean (cf. previous slides).
Bayes approach:
Assuming some prior distribution p() = N(0,02), the posterior distribution can be computed as:
p(D | )p() p( | D) = p(D)
where is a normalizing factor.
=
n k=1
p(xk |)p()
xk , i.e. the
23/25
ML vs. Bayes: Gaussian Case
Bayes approach (continued):
The posterior distribution can be expanded as:
which corresponds to a new Gaussian distribution N(n,n2) with parameters n2 n2 2 1 11
n=2/n+020 and n= 2/n+02
We observe that (1) the mean estimate n is now pulled towards the prior if too little data is available, and (2) the mean estimate comes with a variance term n2 that can be useful for confidence estimation.
24/25
Summary
In practice, parameters of the class-conditioned distributions are not known and they must be inferred from some finite dataset.
Two main approaches for learning these parameters: (1) maximum likelihood estimation and (2) Bayesian inference.
Bayesian inference is often more difficult than maximum likelihood estimation (both analytically and computationally), because it requires integration.
Bayesian inference readily incorporates interesting functionalities (inclusion of priors, construction of confidence estimates).
25/25
Reviews
There are no reviews yet.