Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
February 2, 2015
Today:
Logistic regression
Generative/Discriminative classifiers
Readings: (see class website) Required:
Mitchell: Naive Bayes and Logistic Regression
Optional
Ng & Jordan
Announcements
HW3 due Wednesday Feb 4
HW4 will be handed out next Monday Feb 9
new reading available:
Estimating Probabilities: MLE and MAP (Mitchell)
see Lecture tab of class website
required reading for today:
Naive Bayes and Logistic Regression (Mitchell)
Gaussian Naive Bayes Big Picture
Example: Y= PlayBasketball (boolean), X1=Height, X2=MLgrade assume P(Y=1) = 0.5
Logistic Regression
Idea:
Naive Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)
Why not learn P(Y|X) directly?
Consider learning f: XY, where
X is a vector of real-valued features, < X1 … Xn > Y is boolean
assume all Xi are conditionally independent given Y model P(Xi | Y = yk) as Gaussian N(ik,i)
model P(Y) as Bernoulli ()
What does that imply about the form of P(Y|X)?
Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming ik = i
Very convenient!
implies implies
implies
Very convenient!
implies implies
implies
linear classification rule!
Logistic function
Logistic regression more generally
Logistic regression when Y not boolean (but still discrete-valued).
Now y {y1 yR} : learn R-1 sets of weights for k
where
Training data D =
Data likelihood =
Data conditional likelihood =
Expressing Conditional Log Likelihood
Maximizing Conditional Log Likelihood
Good news: l(W) is concave function of W
Bad news: no closed-form solution to maximize l(W)
Gradient Descent:
Batch gradient: use error over entire training set D Do until satisfied:
1. Compute the gradient
2. Update the vector of parameters:
Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D
Stochastic gradient: use error over single examples Do until satisfied:
1. Choose (with replacement) a random training example 2. Compute the gradient just for :
3. Update the vector of parameters:
Maximize Conditional Log Likelihood: Gradient Ascent
Maximize Conditional Log Likelihood: Gradient Ascent
Gradient ascent algorithm: iterate until change < For all i, repeatThats all for M(C)LE. How about MAP?One common approach is to define priors on W Normal distribution, zero mean, identity covarianceHelps avoid very large weights and overfitting MAP estimatelets assume Gaussian prior: W ~ N(0, )MLE vs MAPMaximum conditional likelihood estimateMaximum a posteriori estimate with prior W~N(0,I) MAP estimates and RegularizationMaximum a posteriori estimate with prior W~N(0,I)called a regularization termhelps reduce overfittingkeep weights nearer to zero (if P(W) is zero meanGaussian prior), or whatever the prior suggests used very frequently in Logistic RegressionThe Bottom Line Consider learning f: XY, whereX is a vector of real-valued features, < X1 … Xn > Y is boolean
assume all Xi are conditionally independent given Y model P(Xi | Y = yk) as Gaussian N(ik,i)
model P(Y) as Bernoulli ()
Then P(Y|X) is of this form, and we can directly estimate W
Furthermore, same holds if the Xi are boolean trying proving that to yourself
Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: XY, or P(Y|X)
Generative classifiers (e.g., Naive Bayes)
Assume some functional form for P(X|Y), P(X)
Estimate parameters of P(X|Y), P(X) directly from training data
Use Bayes rule to calculate P(Y|X= xi)
Discriminative classifiers (e.g., Logistic regression)
Assume some functional form for P(Y|X)
Estimate parameters of P(Y|X) directly from training data
Use Naive Bayes or Logisitic Regression?
Consider
Restrictiveness of modeling assumptions
Rate of convergence (in amount of training data) toward asymptotic hypothesis
Naive Bayes vs Logistic Regression
Consider Y boolean, Xi continuous, X=
Number of parameters: NB: 4n +1
LR: n+1
Estimation method:
NB parameter estimates are uncoupled LR parameter estimates are coupled
G.Naive Bayes vs. Logistic Regression
Recall two assumptions deriving form of LR from GNBayes: 1.Xi conditionally independent of Xk given Y
2.P(Xi | Y = yk) = N(ik,i), not N(ik,ik)
Consider three learning methods: GNB (assumption 1 only)
GNB2 (assumption 1 and 2)
LR
Which method works better if we have infinite training data, and Both (1) and (2) are satisfied
Neither (1) nor (2) is satisfied
(1) is satisfied, but not (2)
G.Naive Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
Recall two assumptions deriving form of LR from GNBayes: 1.Xi conditionally independent of Xk given Y
2.P(Xi | Y = yk) = N(ik,i), not N(ik,ik)
Consider three learning methods: GNB (assumption 1 only) GNB2 (assumption 1 and 2)
LR
Which method works better if we have infinite training data, and Both (1) and (2) are satisfied
Neither (1) nor (2) is satisfied
(1) is satisfied, but not (2)
G.Naive Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
Recall two assumptions deriving form of LR from GNBayes: 1.Xi conditionally independent of Xk given Y
2.P(Xi | Y = yk) = N(ik,i), not N(ik,ik)
Consider three learning methods:
GNB (assumption 1 only) decision surface can be non-linear GNB2 (assumption 1 and 2) decision surface linear
LR decision surface linear, trained without
assumption 1.
Which method works better if we have infinite training data, and
Both (1) and (2) are satisfied: LR = GNB2 = GNB
(1) is satisfied, but not (2) : GNB > GNB2, GNB > LR, LR > GNB2 Neither (1) nor (2) is satisfied: GNB>GNB2, LR > GNB2, LR>
So, GNB requires n = O(log d) to converge, but LR requires n = O(d)
Some experiments from UCI data sets
[Ng & Jordan, 2002]
Naive Bayes vs. Logistic Regression
The bottom line:
GNB2 and LR both use linear decision surfaces, GNB need not
Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did).
But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error
And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might outperform the other
What you should know:
Logistic regression
Functional form follows from Naive Bayes assumptions For Gaussian Naive Bayes assuming variance i,k = i
For discrete-valued Naive Bayes too
But training procedure picks parameters without making conditional independence assumption
MLE training: pick W to maximize P(Y | X, W)
MAP training: pick W to maximize P(W | X,Y) regularization
helps reduce overfitting
Gradient ascent/descent
General approach when closed-form solutions unavailable
Generative vs. Discriminative classifiers Bias vs. variance tradeoff
extra slides
What is the minimum possible error?
Best case:
conditionalindependenceassumptionissatistied
weknowP(Y),P(X|Y)perfectly(e.g.,infinitetrainingdata)
Questions to think about:
Can you use Naive Bayes for a combination of discrete and real-valued Xi?
How can we easily model the assumption that just 2 of the n attributes as dependent?
What does the decision surface of a Naive Bayes classifier look like?
How would you select a subset of Xis?
Reviews
There are no reviews yet.