Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 21, 2015
Today:
BayesRule
Estimatingparameters
MLE MAP
some of these slides are derived from William Cohen, Andrew Moore, Aarti Singh, Eric Xing, Carlos Guestrin. Thanks!
Readings:
Probabilityreview
BishopCh.1thru1.2.3 Bishop,Ch.2thru2.2
AndrewMooresonline tutorial
Announcements
Class is using Piazza for questions/discussions about homeworks, etc.
see class website for Piazza address
http://www.cs.cmu.edu/~ninamf/courses/601sp15/
Recitations thursdays 7-8pm, Wean 5409 videos for future recitations (class website)
HW1 was accepted to Sunday 5pm for full credit
HW2 out today on class website, due in 1 week
HW3 will involve programming (in Octave )
P(A|B) = P(B|A) * P(A) P(B)
we call P(A) the prior and P(A|B) the posterior
Bayes rule
by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
Other Forms of Bayes Rule
P(A|B) =
P(B|A) * P(A) P(B)
P(A|B) = P(B| A)P(A)
P(B| A)P(A)+P(B|~ A)P(~ A)
P(A|B X) = P(B| A X)P(A X) P(BX)
Applying Bayes Rule
P(A|B)= P(B|A)P(A)
P(B | A)P(A) + P(B |~ A)P(~ A)
A = you have the flu, B = you just coughed
Assume:
P(A) = 0.05 P(B|A) = 0.80 P(B| ~A) = 0.20
what is P(flu | cough) = P(A|B)?
what does all this have to do with function approximation?
instead of F: XY, learn P(Y | X)
The Joint Distribution
Recipe for making a joint distribution of M variables:
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]
The Joint Distribution
Recipe for making a joint distribution of M variables:
1.Make a truth table listing all combinations of values (M Boolean variables2M rows).
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]
The Joint Distribution
Recipe for making a joint distribution of M variables:
1.Make a truth table listing all combinations of values (M Boolean variables2M rows).
2.For each combination of values, say how probable it is.
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]
The Joint Distribution
Recipe for making a joint distribution of M variables:
1.Make a truth table listing all combinations of values (M Boolean variables2M rows).
2.For each combination of values, say how probable it is.
3.If you subscribe to the axioms of probability, those probabilities must sum to 1.
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]
Using the Joint Distribution
One you have the JD you can ask for the probability of any logical expression involving these variables
P(E) =
P(row) rows matching E
[A. Moore]
Using the Joint
P(Poor Male) = 0.4654
P(E) =
P(row) rows matching E
[A. Moore]
Using the Joint
P(Poor) = 0.7604
P(E) =
P(row) rows matching E
[A. Moore]
Inference with the Joint
P(row) = rows matching E1 and E2
P(E2 )
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
P(E | E ) = 12
P(E E ) 1 2
P(row) rows matching E2
[A. Moore]
Learning and the Joint Distribution
Suppose we want to learn the function f:
Solution: learn joint distribution from data, calculate P(W | G, H) e.g., P(W=rich | G = female, H = 40.5- ) =
[A. Moore]
sounds like the solution to learning F: XY,
or P(Y | X). Are we done?
sounds like the solution to learning F: XY,
or P(Y | X).
Main problem: learning P(Y|X) can require more data than we have
consider learning Joint Dist. with 100 attributes
# of rows in this table?
# of people on earth?
fraction of rows with 0 training examples?
What to do?
1.Be smart about how we estimate probabilities from sparse data
maximum likelihood estimates
maximum a posteriori estimates
2.Be smart about how to represent joint distributions
Bayes networks, graphical models
1. Be smart about how we estimate probabilities
Estimating Probability of Heads
X=1 X=0
Estimating = P(X=1)
X=1 X=0 100 flips: 51 Heads (X=1), 49 Tails (X=0)
Test B:
3 flips: 2 Heads (X=1), 1 Tails (X=0)
Test A:
Estimating = P(X=1)
Case C: (online learning)
X=1 X=0
keep flipping, want single learning algorithm that gives reasonable estimate after each flip
Principles for Estimating Probabilities
Principle 1 (maximum likelihood):
choose parameters that maximize P(data | ) e.g.,
Principle 2 (maximum a posteriori prob.):
choose parameters that maximize P( | data) e.g.
Maximum Likelihood Estimation
P(X=1) = P(X=0) = (1-) Data D:
X=1 X=0
Flips produce data D with heads, tails
flipsareindependent,identicallydistributed1sand0s (Bernoulli)
and are counts that sum these outcomes (Binomial)
Maximum Likelihood Estimate for
[C. Guestrin]
hint:
Summary:
Maximum Likelihood Estimate
X=1 X=0
P(X=1) = P(X=0) = 1- (Bernoulli)
Principles for Estimating Probabilities
Principle 1 (maximum likelihood):
choose parameters that maximize P(data | )
Principle 2 (maximum a posteriori prob.): choose parameters that maximize
P( | data) = P(data | ) P() P(data)
Beta prior distribution P()
Beta prior distribution P()
[C. Guestrin]
and MAP estimate is therefore
and MAP estimate is therefore
Some terminology
Likelihood function: P(data | )
Prior: P()
Posterior: P( | data)
Conjugate prior: P() is the conjugate prior for likelihood function P(data | ) if the forms of P() and P( | data) are the same.
You should know
Probabilitybasics
random variables, conditional probs,
Bayes rule
Joint probability distributions
calculating probabilities from the joint distribution
Estimatingparametersfromdata
maximum likelihood estimates
maximum a posteriori estimates
distributions binomial, Beta, Dirichlet, conjugate priors
Extra slides
Independent Events
Definition: two events A and B are independent if P(A ^ B)=P(A)*P(B)
Intuition: knowing A tells us nothing about the value of B (and vice versa)
Picture A independent of B
Expected values
Given a discrete random variable X, the expected value of X, written E[X] is
Example:
X P(X)
0 0.3
1 0.2
2 0.5
Expected values
Given discrete random variable X, the expected value of X, written E[X] is
We also can talk about the expected value of functions of X
Covariance
Given two discrete r.v.s X and Y, we define the covariance of X and Y as
e.g., X=gender, Y=playsFootball or X=gender, Y=leftHanded
Remember:
Reviews
There are no reviews yet.