Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 26, 2015
Today:
BayesClassifiers
ConditionalIndependence NaiveBayes
Readings: Mitchell:
Naive Bayes and Logistic Regression
(available on class website) 
Two Principles for Estimating Parameters
MaximumLikelihoodEstimate(MLE):choosethat maximizes probability of observed data
MaximumaPosteriori(MAP)estimate:choosethat is most probable given prior probability and the data 
Maximum Likelihood Estimate
X=1 X=0 P(X=1) = 
P(X=0) = 1- (Bernoulli) 
Maximum A Posteriori (MAP) Estimate
X=1 X=0 
Lets learn classifiers by learning P(Y|X)
Consider Y=Wealth, X=
 Gender HrsWorked P(rich | G,HW) P(poor | G,HW) 
 F <40.5 .09 .91 F >40.5 .21 .79 
 M <40.5 .23 .77 M >40.5 .38 .62
How many parameters must we estimate?
Suppose X =
 where Xi and Y are boolean RVs 
 To estimate P(Y| X1, X2,  Xn) 
 If we have 30 boolean Xis: P(Y | X1, X2,  X30)
Bayes Rule
Which is shorthand for:
Equivalently: 
Can we reduce params using Bayes Rule?
Suppose X =
 where Xi and Y are boolean RVs 
 How many parameters to define P(X1, Xn | Y)? 
 How many parameters to define P(Y)?
Naive Bayes
Naive Bayes assumes
i.e., that Xi and Xj are conditionally independent given Y, for all i=j 
Conditional Independence
Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z
Which we often write E.g., 
Naive Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then: 
Naive Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then:
in general: 
Naive Bayes uses assumption that the Xi are conditionally independent, given Y. E.g.,
Given this assumption, then:
in general:
How many parameters to describe P(X1Xn|Y)? P(Y)? Withoutconditionalindepassumption?
Withconditionalindepassumption? 
Naive Bayes in a Nutshell
Bayes rule:
Assuming conditional independence among Xis:
So, to pick most probable Y for Xnew = < X1, …, Xn > 
Naive Bayes Algorithm  discrete Xi
Train Naive Bayes (examples) for each* value yk
estimate
for each* value xij of each attribute Xi
estimate Classify (Xnew)
* probabilities must sum to 1, so need estimate only n-1 of these 
Estimating Parameters: Y, Xi discrete-valued Maximum likelihood estimates (MLEs):
Number of items in dataset D for which Y=yk 
Example: Live in Sq Hill? P(S|G,D,B)
S=1 iff live in Squirrel Hill D=1 iff Drive or carpool to CMU
G=1 iff shop at SH Giant Eagle B=1 iff Birthday is before July 1
What probability parameters must we estimate? 
Example: Live in Sq Hill? P(S|G,D,E)
S=1 iff live in Squirrel Hill
G=1 iff shop at SH Giant Eagle
P(S=1) :
P(D=1 | S=1) :
P(D=1 | S=0) :
P(G=1 | S=1) :
P(G=1 | S=0) : P(G=0 | S=0) : P(B=1 | S=1) :
P(B=1 | S=0) :
D=1 iff Drive or Carpool to CMU B=1 iff Birthday is before July 1
P(S=0) : P(D=0 | S=1) : P(D=0 | S=0) : P(G=0 | S=1) :
P(B=0 | S=1) : P(B=0 | S=0) : 
Naive Bayes: Subtlety #1
Often the Xi are not really conditionally independent
We use Naive Bayes in many cases anyway, and it often works pretty well
often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996])
What is effect on estimated P(Y|X)?  Extremecase:whatifweaddtwocopies:Xi =Xk 
Extremecase:whatifweaddtwocopies:Xi =Xk
Naive Bayes: Subtlety #2
If unlucky, our MLE estimate for P(Xi | Y) might be zero. (for example, Xi = birthdate. Xi = Jan_25_1992)
Why worry about just one parameter out of many?
What can be done to address this? 
Estimating Parameters
MaximumLikelihoodEstimate(MLE):choosethat maximizes probability of observed data
MaximumaPosteriori(MAP)estimate:choosethat is most probable given prior probability and the data 
Estimating Parameters: Y, Xi discrete-valued Maximum likelihood estimates:
MAP estimates (Beta, Dirichlet priors):
Only difference: imaginary examples 
Learning to classify text documents
Classifywhichemailsarespam?
Classifywhichemailspromiseanattachment?
Classifywhichwebpagesarestudenthomepages?
How shall we represent text documents for Naive Bayes? 
Baseline: Bag of Words Approach
aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 
gas 1 
oil 1 
Zaire 0 
Learning to classify document: P(Y|X) the Bag of Words model
Y discrete valued. e.g., Spam or not
X=
 Xi is a random variable describing the word at position i in the document 
 possible values for Xi : any word wk in English 
 Document=bagofwords:thevectorofcountsforall wks 
 like #heads, #tails, but we have many more than 2 values 
 assume word probabilities are position independent (i.i.d. rolls of a 50,000-sided die)
Naive Bayes Algorithm  discrete Xi
TrainNaiveBayes(examples) for each value yk
estimate
for each value xj of each attribute Xi
estimate Classify(Xnew)
prob that word xj appears in position i, given Y=yk
* Additional assumption: word probabilities are position independent 
MAP estimates for bag of words
Map estimate for multinomial
What s should we choose? 
For code and data, see
www.cs.cmu.edu/~tom/mlbook.html
click on Software and Data 
What you should know:
Training and using classifiers based on Bayes rule
Conditional independence What it is
Why its important
Naive Bayes What it is
Why we use it so much
Training using MLE, MAP estimates
Discrete variables and continuous (Gaussian) 
Questions:
HowcanweextendNaiveBayesifjust2oftheXis are dependent?
WhatdoesthedecisionsurfaceofaNaiveBayes classifier look like?
WhaterrorwilltheclassifierachieveifNaiveBayes assumption is satisfied and we have infinite training data?
CanyouuseNaiveBayesforacombinationof discrete and real-valued Xi? 

![[SOLVED] CS algorithm Machine Learning 10-601](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[SOLVED] Naughty Receiver - Reliable Data Transfer](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
 
 
 
Reviews
There are no reviews yet.