5/5 - (1 vote)

Introduction to Machine Learning EM algorithm
Prof. Kutty

Generative models

Gaussian Mixture Model (GMM)

Mixture of Gaussians
image source: Bishop 2006

MLE of GMM with known labels: Example
1.4 -0.625
2.1, 0, 3.5, 1, 1.5, 2.5, 0.5, 0.05,1, 2, 0, 1, 2, 1.1, 0.5, 0.03

Log-Likelihood for GMMs with known labels
!! =$%((,*())
Sso f Maximum log likelihood objective
=$%((|*())%(*()) ! (#$
=$-. / 0)(1 (() 2(),3))4) #$ #$
ln# $! =ln&'( ) *)(- /() 0(%),2%))3%)
#$ %#$ !&
=() *)ln($!%&'(#)(&(!),*!%)) #$ %#$

Gaussian Mixture Model (GMM) Model Parameters
How many independent model parameters in a mixture of 4 spherical Gaussians?
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
spherical Gaussian R e c a l l P d f o f ( j ) 2
1 (j) 2 2 2 | | x g | |
P(x | ,)= N
D e l l a j
( 2 j2 ) d / 2

Gaussian Mixture Model (GMM) Model Parameters
How many independent model parameters in a mixture of 4 spherical Gaussians?
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
12 ||x (j)||2
( 2 j2 ) d / 2
(j) 2 P(x | ,)=
= [1, , k, (1), , (k), 12, ., k2]

MLE for GMMs with known labels
Maximum log likelihood objective
,,-. /)ln(.!/1020(!),4!$)) !# %#
MLE solution (given cluster labels):
Dune maize JW
65=! ()*) % #$
3 % = !* 5 !
0(%)=$! () *)/() !*5 #$
number of points assigned to cluster j
fraction of points assigned to cluster j
mean of points in cluster j
2)=$! ()*)/0(%) % +!*5 #$
spread in cluster j

MLE for GMMs with unknown labels

Parameters of GMMs
2.1, 0, 3.5, 1, 1.5, 2.5, 0.5, 0.05,1, 2, 0, 1, 2, 1.1, 0.5, 0.03

Learning the Model Parameters
! =! %(()=! ( %((,, =-) ! #$ #$ #$
=./%(( |, #$ #$
=-)%(, =-) =./1 (() 2(),3))4
!( #$ #$
Given the training data, find the model parameters that maximize the log-likelihood
ln(# $! ) !&!&
=ln &-/ 0%,2%))3% =ln -/ 0%,2%))3% #$ %#$ #$ %#$

Expectation Maximization for GMMs

Expectation Maximization for GMMs: overview
Iterate until convergence
E step: use current estimate of mixture model to softly
assign examples to clusters
M step: re-estimate each cluster model separately based on the points assigned to it (similar to the known label case)
access parameters
assume model
to estimate grain parameters

fix =[1,,k, (1),, (k),12,.,k2]
Expectation Maximization for GMMs
softly assign points to clusters according to posterior prob
&'( = *6+(-(7)|/6,168) 9*9+(-(7)|/9,198) eg
P blue dataptz
pdf bluegaussian
i3 soft cluster assignment % 5
blue evaluated for torrentz.ie I i
Pdf bluegaussian evaluated for datapoint3
given a datapoint (() what is the probability that cluster generated it Analogousto6 -5 notethat% -5 =1

Expectation Maximization for GMMs E-step: Example
E-step: softly assign points to clusters according to current guess of
model parameters
j P (x (i)| (j), j2) (i)
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
Datapoints
Cluster 1 0.5
Cluster 2 0.3
Cluster 3 0.2
variance = 1
mean = 1 variance = 1
mean = 3 variance = 4
) j P at M 6,2 265#) 25#)
p211 p311 31#4(#),5#)= 1 exp1#4#
which is the likeliest cluster for datapoint 1(#)

Expectation Maximization for GMMs E-step: Example
E-step: softly assign points to clusters according to current guess of
model parameters Example
j P (x (i)| (j), j2) (i)
Datapoints
Cluster 1 0.5
Cluster 2 0.3
Cluster 3 0.2
variance = 1
mean = 1 variance = 1
mean = 3 variance = 4
0.5 * 0.39894
0.2 * 0.06476
0 . 3 * 0 . 2 4 1 9 7a t o # #
L 31# 4(#),5#) = 1 exp1 4 t 265#) 25#)

Expectation Maximization for GMMs M-SteXp: optimizes each cluster separately given p(j|i)
n = p ( j | i ) ( j ) ( i )
j = n p ( j | i ) x i=1 j i=1
(i) (j)2 p ( j | i ) | | x | |
j = d n

Expectation Maximization for GMMs:
M step (note correspondence with known labels)
if you knew the soft cluster assignment 9 ) * , you could compute MLE parameters : as follows
MLE for GMM with known labels Xn ;:=$-./) n=p(j|i)
= % = $- ! j = n j
effective number of points assigned to cluster j fraction of points assigned to cluster j
1 Xn nji=1
5)=#$ -./)1!4(%))
j = d n
p(j|i)x (i)
% .$-! !#
. /) 1(!) (j) =
weighted mean of points in cluster j
p ( j | i ) | | x
(j) 2 | |
weighted spread in cluster j

Expectation Maximization for GMMs M-step: Example
M-Step: optimizes each cluster separately given p(j|i)
Xn n 1Xn 1Xn j (j) (i)
n=p(j|i) = p(j|i)x 2
j j = n j = d n p ( j | i ) | | x
(i)(j) |
use this link for in-class exercises
https://forms.gle/jqAdK1sSMhcx6zDHA
i=1 n j i=1 j i=1 ;: # =
Datapoints
1(#) = 0,1 /
1()) = 2,1 /
1(*) = 1,1 /
1(+) = 0,2 /
1(,) = 2,2 /

Expectation Maximization for GMMs M-step: Example
M-Step: optimizes each cluster separately given p(j|i)
Xn n 1Xn 1Xn j (j) (i)
n=p(j|i) = p(j|i)x 2
j j = n j = d n p ( j | i ) | | x
(i)(j) |
i=1 n j i=1 j i=1 ;:# =0.2+0.1+0.4+0.7+0.8=2.2
?(#) (!) 4 =;:#,D1/1
=:# =;:# =2.2=0.44
Datapoints
1(#) = 0,1 /
1()) = 2,1 /
1(*) = 1,1 /
1(+) = 0,2 /
1(,) = 2,2 /
=2.2ED11 1(#)+D12 1())+D13 1(*)+D14 1(+) +D15 1(,)F
= 1 0.20,1/+0.12,1/+0.41,1/+0.70,2/+0.82,2/ 2.2 # ?)
Similarlycompute5:)= , D1/ 1! 4# # .$- !#

Expectation Maximization
this example
with general nuns
Iretivariate
normal distribute

Model Selection: how to pick k?
Bayesian Information Criterion (BIC)
Log-likelihood
BIC(D; ) = l(D; )
number of training data
Here wed want to maximize the BIC.
Sometimes defined as the negative of above definition. In such cases, we want to minimize.
model complexity

where the penalty term approximates the advantage that we would expect to get from
arger k regardless of the data. The BIC score is easy to evaluate for each k as we
lready get l(D; ) from the EM algorithm. All that we need in addition is to evaluate he penalty term. A mixture of k spherical Gaussians in d dimensions has exactly
Model Selection for Mixtures
(d + 2) 1 parameters. Figure 1 below shows that the resulting BIC score indeed has
he highest value for the correct 3-component mixture.
BIC is an asymptotic (large n) approximation to a statistically more well-founded
Bayesian Information Criterion (BIC)
riterion known as the Bayesian score. As such, BIC will select the right model (under ertain regularity conditions) when n is (very) large relative to d. However, we will
dopt the BIC criterion here even for smaller n due to its simplicity. G=2 G=3 G=4
2 1 0 1 2 3 4 5
2 1 0 1 2 3 4 5
BIC(D; ) = 131.16 BIC(D; ) = 118.93 BIC(D; ) = 121.78
igure 1: 2, 3, and 4 component mixtures estimated for the same data. The correspond-
From Jaakkola
ng log-likelihoods and the BIC scores are shown below each plot.
2 1 0 1 2 3 4 5

Bayesian Networks: Applications
Alexiou Athanasios, D., H., A. [2017] A Bayesian Model for the Prediction and Early Diagnosis of Alzheimers Disease

Bayesian Networks by Example
nodes: variable
directed edges: dependencies 1 1# )
1# is a parent of 1* 1* is a child of 1#
1%:1& == 1$ 22 = 4 rows
intuitively, read this edge as influences
Pr 1* =J|L0,L1
Pr 1* =M|L0,L1
joint probability distribution: Pr 1#, 1), 1*
= Pr(1#) Pr(1)) Pr 1* 1#, 1)

1%:1& ==1$
Pr 1* =J|L0,L1
Pr 1* =M|L0,L1

Two notions of Independence
Marginal independence
Pr =$, =) = Pr(=$)Pr(=))
Conditional independence
Pr =$, =)|=:
= Pr(=$|=:)Pr(=)|=:) =$ =)|=:
Alternately, Pr =$|=), =: = Pr(=$|=:)
Bayesian Networks provide us a way to determine these via the dependency graph

d-separation: Inferring independence

Inferring independence properties
P J | S?
Step 1: keep only ancestral graph of the variables of interest

Inferring independence properties
P J | S?
Step 2: connect nodes with common child and change graph to undirected
* if multiple parents connect pairwise

Inferring independence properties
If all paths between variables of interest go through a particular node, then the variables are independent given that node
intuitively can say that that node blocks the influence from the first variable to the second n
P J | S?
If there is no path between variables of interest, then they are marginally independent

CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] ED11 1(#)+D12 1())+D13 1(*)+D14 1(+) +D15 1(,)F

Reviews

Whatsapp Us

[SOLVED] ED11 1(#)+D12 1())+D13 1(*)+D14 1(+) +D15 1(,)F

Reviews

Related products

[Solved] Python Assignment-Financial Products and Markets

[Solved] List Maintainer

[SOLVED] COP 3223 Program #4: Turtle Time and List Power

[SOLVED] Programming Project for TCP Socket Programming

[Solved] Python program to manage information about baseball players

[Solved] Program that has three functions: sepia(), remove_all_red(), and gray_scale()