Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
March 4, 2015
Today:
Graphical models
Bayes Nets:
EM
Mixture of Gaussian
clustering
Learning Bayes Net structure (Chow-Liu)
Readings:
Bishop chapter 8
Mitchell chapter 6
Learning of Bayes Nets
Fourcategoriesoflearningproblems
Graph structure may be known/unknown
Variable values may be fully observed / partly unobserved
Easycase:learnparametersforgraphstructureis known, and data is fully observed
Interestingcase:graphknown,datapartlyknown
Gruesomecase:graphstructureunknown,datapartly unobserved
EM Algorithm Informally
EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S})
Begin with arbitrary choice for parameters
Iterate until convergence:
E Step: estimate the values of unobserved Z, using
M Step: use observed values plus E-step estimates to derive a better
Guaranteed to find local maximum. Each iteration increases
EM Algorithm Precisely
EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S}) Define
Iterate until convergence:
E Step: Use X and current to calculate P(Z|X,) M Step: Replace current by
Guaranteed to find local maximum. Each iteration increases
E Step: Use X, , to Calculate P(Z|X,) observed X={F,A,H,N}, Flu
Allergy
Sinus
Nose
unobserved Z={S}
How? Bayes net inference problem.
Headache
lets use p(a,b) as shorthand for p(A=a, B=b)
EM and estimating
observed X = {F,A,H,N}, unobserved Z={S}
Flu
Sinus
Allergy
Nose
Headache
E step: Calculate P(Zk|Xk; ) for each training example, k
M step: update all relevant parameters. For example:
Recall MLE was:
EM and estimating
Flu
Sinus
Allergy
More generally, Headache Nose Given observed set X, unobserved set Z of boolean values
E step: Calculate for each training example, k
the expected value of each unobserved variable in each training example
M step:
Calculate similar to MLE estimates, but replacing each count by its expected count
Using Unlabeled Data to Help Train Naive Bayes Classifier
Learn P(Y|X)
Y
Y
X1
X2
X3
X4
1
0
0
1
1
0
0
1
0
0
0
0
0
1
0
?
0
1
1
0
?
0
1
0
1
X1 X2 X3 X4
EM and estimating
Given observed set X, unobserved set Y of boolean values
E step: Calculate for each training example, k
the expected value of each unobserved variable Y
M step: Calculate estimates similar to MLE, but replacing each count by its expected count
MLE would be:
Experimental Evaluation
Newsgrouppostings
20 newsgroups, 1000/group
Webpageclassification
student, faculty, course, project
4199 web pages
Reutersnewswirearticles 12,902 articles
90 topics categories
From [Nigam et al., 2000]
20 Newsgroups
word w ranked by P(w|Y=course) / P(w|Y = course)
Using one labeled example per class
20 Newsgroups
Usupervised clustering
Just extreme case for EM with zero labeled examples
Clustering
Givensetofdatapoints,groupthem
Unsupervisedlearning
Whichpatientsaresimilar?(orwhichearthquakes, customers, faces, web pages, )
Mixture Distributions
Model joint as mixture of multiple distributions.
Use discrete-valued random var Z to indicate which distribution is being use for each random draw
So
Mixture of Gaussians:
AssumeeachdatapointX=
1.randomly choose Gaussian i, according to P(Z=i)
2.randomly generate a data point
Mixture of Gaussians
EM for Mixture of Gaussian Clustering
Lets simplify to make this easier:
1.assume X=
given Z.
2.assume only 2 clusters (values of Z), and
3.Assume known, 1 K, 1i Ki unknown
Z
Observed: X=
X1 X2 X3 X4
EM
Given observed variables X, unobserved Z Define
where
Z
X1 X2 X3 X4
Iterate until convergence:
E Step: Calculate P(Z(n)|X(n),) for each example X(n). Use this to construct
M Step: Replace current by
EM E Step
Z
Calculate P(Z(n)|X(n),) for each observed example X(n) X(n)=
X1 X2 X3 X4
First consider update for
EM M Step Z
has no influence
z=1 for nth example
X1 X2 X3 X4
Now consider update for ji
EM M Step Z
ji has no influence
X1 X2 X3 X4
Compare above to MLE if Z were observable:
EM putting it together
Z
Given observed variables X, unobserved Z Define
where
X1 X2 X3 X4
Iterate until convergence:
E Step: For each observed example X(n), calculate P(Z(n)|X(n),)
M Step: Update
Mixture of Gaussians applet
Go to: http://www.socr.ucla.edu/htmls/SOCR_Charts.html then go to Go to Line Charts SOCR EM Mixture Chart tryitwith2Gaussianmixturecomponents(kernels)
tryitwith4
What you should know about EM
Forlearningfrompartlyunobserveddata
MLEof=
EMestimate:=
Where X is observed part of data, Z is unobserved
NicecaseisBayesnetofbooleanvars:
M step is like MLE, with with unobserved values replaced by
their expected values, given the other observed values
EMfortrainingBayesnetworks
CanalsodevelopMAPversionofEM
CanalsoderiveyourownEMalgorithmforyourown problem
write out expression for
E step: for each training example Xk, calculate P(Zk | Xk, ) M step: chose new to maximize
Learning Bayes Net Structure
How can we learn Bayes Net graph structure?
In general case, open problem
canrequirelotsofdata(elsehighriskofoverfitting) canuseBayesianmethodstoconstrainsearch
One key result:
Chow-Liualgorithm:findsbesttree-structurednetwork
Whatsbest?
suppose P(X) is true distribution, T(X) is our tree-structured
network, where X =
Chow-Liu minimizes Kullback-Leibler divergence:
Chow-Liu Algorithm
Key result: To minimize KL(P || T), it suffices to find the tree network T that maximizes the sum of mutual informations over its edges
Mutual information for an edge between variable A and B:
This works because for tree networks with nodes
Chow-Liu Algorithm
1.for each pair of vars A,B, use data to estimate P(A,B), P(A), P(B)
2.for each pair of vars A,B calculate mutual information
3.calculate the maximum spanning tree over the set of variables, using edge weights I(A,B)
(given N vars, this costs only O(N2) time)
4.add arrows to edges to form a directed-acyclic graph
5.learn the CPDs for this graph
Chow-Liu algorithm example
Greedy Algorithm to find Max-Spanning Tree
1/
1/
1/
1/
1/ 1/
1/ 1/
1/
1/
1/
[courtesy A. Singh, C. Guestrin]
Bayes Nets What You Should Know
Representation
Bayes nets represent joint distribution as a DAG + Conditional
Distributions
D-separation lets us decode conditional independence assumptions
Inference
NP-hard in general
For some graphs, closed form inference is feasible
Approximate methods too, e.g., Monte Carlo methods,
Learning
Easy for known graph, fully observed data (MLEs, MAP est.)
EM for partly observed data, known graph
Learning graph structure: Chow-Liu for tree-structured networks Hardest when graph unknown, data incompletely observed
Reviews
There are no reviews yet.