Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 28, 2015
Today:
Naive Bayes
discrete-valued Xis
Document classification
Gaussian Naive Bayes
real-valued Xis
Brain image classification
Readings: Required:
Mitchell: Naive Bayes and Logistic Regression
(available on class website)
Optional
Bishop 1.2.4
Bishop 4.2
Recently:
BayesclassifierstolearnP(Y|X)
MLEandMAPestimatesforparametersofP
Conditionalindependence
Naive Bayes make Bayesian learning practical
Next:
Textclassification
NaiveBayesandcontinuousvariablesXi:
GaussianNaiveBayesclassifier LearnP(Y|X)directly
Logisticregression,Regularization,Gradientascent NaiveBayesorLogisticRegression?
Generativevs.Discriminativeclassifiers
Naive Bayes in a Nutshell
Bayes rule:
Assuming conditional independence among Xis: So, classification rule for Xnew = < X1, …, Xn > is:
Example: Live in Sq Hill? P(S|G,D,B)
S=1 iff live in Squirrel Hill
G=1 iff shop at SH Giant Eagle
P(S=1) : P(D=1 | S=1) : P(D=1 | S=0) : P(G=1 | S=1) : P(G=1 | S=0) : P(B=1 | S=1) : P(B=1 | S=0) :
Tom: D=1, G=0, B=0 P(S=1|D=1,G=0,B=0) =
D=1 iff Drive or Carpool to CMU B=1 iff Birthday is before July 1
P(S=0) : P(D=0 | S=1) : P(D=0 | S=0) : P(G=0 | S=1) : P(G=0 | S=0) : P(B=0 | S=1) : P(B=0 | S=0) :
P(S=1) P(D=1|S=1) P(G=0|S=1) P(B=0|S=1) _____________________________________________________________________________
[P(S=1) P(D=1|S=1) P(G=0|S=1) P(B=0|S=1) + P(S=0) P(D=1|S=0) P(G=0|S=0) P(B=0|S=0)]
Another way to view Naive Bayes (Boolean Y): Decision rule: is this quantity greater or less than 1?
Naive Bayes: classifying text documents
Classify which emails are spam?
Classify which emails promise an attachment?
How shall we represent text documents for Naive Bayes?
Learning to classify documents: P(Y|X)
Ydiscretevalued. e.g., Spam or not
X=
Xi is a random variable describing
Learning to classify documents: P(Y|X)
Ydiscretevalued. e.g., Spam or not
X=
Xi is a random variable describing
Answer 1: Xi is boolean, 1 if word i is in document, else 0
e.g., Xpleased = 1
Issues?
Learning to classify documents: P(Y|X)
Ydiscretevalued. e.g., Spam or not
X=
Xi is a random variable describing
Answer 2:
Xi represents the ith word position in document
X1 = I, X2 = am, X3 = pleased
and,letsassumetheXiareiid(indep,identicallydistributed)
Learning to classify document: P(Y|X) the Bag of Words model
Y discrete valued. e.g., Spam or not
X=
Xi are iid random variables. Each represents the word at its position i in the document
Generatingadocumentaccordingtothisdistribution= rolling a 50,000 sided die, once for each word position in the document
Theobservedcountsforeachwordfollowa???distribution
Multinomial Distribution
Multinomial Bag of Words
aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0
gas 1
oil 1
Zaire 0
MAP estimates for bag of words
Map estimate for multinomial
What s should we choose?
Naive Bayes Algorithm discrete Xi
TrainNaiveBayes(examples) for each value yk
estimate
for each value xij of each attribute Xi
estimate Classify(Xnew)
prob that word xij appears in position i, given Y=yk
* Additional assumption: word probabilities are position independent
For code and data, see
www.cs.cmu.edu/~tom/mlbook.html
click on Software and Data
What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel
What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel
Naive Bayes requires P(Xi | Y=yk), but Xi is real (continuous)
Common approach: assume P(Xi | Y=yk) follows a Normal (Gaussian) distribution
What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel
Naive Bayes requires P(Xi | Y=yk), but Xi is real (continuous)
Common approach: assume P(Xi | Y=yk) follows a Normal (Gaussian) distribution
Gaussian
Distribution
(also called Normal)
p(x) is a probability density function, whose integral (not sum) is 1
What if we have continuous Xi ? Gaussian Naive Bayes (GNB): assume
Sometimes assume variance is independent of Y (i.e., i), or independent of Xi (i.e., k) or both (i.e., )
Gaussian Naive Bayes Algorithm continuous Xi (but still discrete Y)
Train Naive Bayes (examples) for each value yk
estimate*
for each attribute Xi estimate
class conditional mean , variance Classify (Xnew)
* probabilities must sum to 1, so need estimate only n-1 parameters
jth training example
ith feature
kth class
()=1 if (Yj=yk) else 0
Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates:
How many parameters must we estimate for Gaussian Naive Bayes if Y has k possible values, X=
GNB Example: Classify a persons cognitive state, based on brain image
reading a sentence or viewing a picture?
reading the word describing a Tool or Building? answering the question, or getting confused?
Mean activations over all training examples for Y=bottle
Y is the mental state (reading house or bottle) Xi are the voxel activities,
this is a plot of the s defining P(Xi | Y=bottle)
average
fMRI activation
high
below average
Classification task: is person viewing a tool or building?
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
statistically significant p<0.05 p4 p8 p6p11p5 p7p10p9 p2p12p3 p1ParticipantsClassification accuracyClassification accuracy Where is information encoded in the brain?Accuracies of cubical 27-voxel classifiers centered at each significant voxel [0.7-0.8]Naive Bayes: What you should knowDesigningclassifiersbasedonBayesruleConditionalindependence What it isWhy its importantNaiveBayesassumptionanditsconsequencesWhich (and how many) parameters must be estimated under different generative models (different forms for P(X|Y) )and why this mattersHowtotrainNaiveBayesclassifiers MLE and MAP estimateswith discrete and/or continuous inputs XiQuestions to think about:Can you use Naive Bayes for a combination of discrete and real-valued Xi?How can we easily model just 2 of n attributes as dependent?What does the decision surface of a Naive Bayes classifier look like?How would you select a subset of Xis? Simple Picture for GNB for P(Y|X1)
Reviews
There are no reviews yet.