[SOLVED] CS代考计算机代写 algorithm Machine Learning 10-601

30 $

File Name: CS代考计算机代写_algorithm_Machine_Learning_10-601.zip
File Size: 536.94 KB

SKU: 6337898716 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
February 2, 2015
Today:
•Logistic regression
•Generative/Discriminative classifiers
Readings: (see class website) Required:
•Mitchell: “Naïve Bayes and Logistic Regression”
Optional
•Ng & Jordan

Announcements
•HW3 due Wednesday Feb 4
•HW4 will be handed out next Monday Feb 9
•new reading available:
–Estimating Probabilities: MLE and MAP (Mitchell)
–see Lecture tab of class website
•required reading for today:
–Naïve Bayes and Logistic Regression (Mitchell)

Gaussian Naïve Bayes – Big Picture
Example: Y= PlayBasketball (boolean), X1=Height, X2=MLgrade assume P(Y=1) = 0.5

Logistic Regression
Idea:
•Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)
•Why not learn P(Y|X) directly?

• Consider learning f: XàY, where
•X is a vector of real-valued features, < X1 … Xn > • Y is boolean
•assume all Xi are conditionally independent given Y •model P(Xi | Y = yk) as Gaussian N(μik,σi)
•model P(Y) as Bernoulli (π)
•What does that imply about the form of P(Y|X)?

Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi

Very convenient!
implies implies
implies

Very convenient!
implies implies
implies
linear classification rule!

Logistic function

Logistic regression more generally
•Logistic regression when Y not boolean (but still discrete-valued).
•Now y ∈ {y1 … yR} : learn R-1 sets of weights for k to maximize conditional likelihood of training data
where
•Training data D =
•Data likelihood =
•Data conditional likelihood =

Expressing Conditional Log Likelihood

Maximizing Conditional Log Likelihood
Good news: l(W) is concave function of W
Bad news: no closed-form solution to maximize l(W)

Gradient Descent:
Batch gradient: use error over entire training set D Do until satisfied:
1. Compute the gradient
2. Update the vector of parameters:
Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D
Stochastic gradient: use error over single examples Do until satisfied:
1. Choose (with replacement) a random training example 2. Compute the gradient just for :
3. Update the vector of parameters:

Maximize Conditional Log Likelihood: Gradient Ascent

Maximize Conditional Log Likelihood: Gradient Ascent
Gradient ascent algorithm: iterate until change < ε For all i, repeatThat’s all for M(C)LE. How about MAP?•One common approach is to define priors on W –Normal distribution, zero mean, identity covariance•Helps avoid very large weights and overfitting •MAP estimate•let’s assume Gaussian prior: W ~ N(0, σ)MLE vs MAP•Maximum conditional likelihood estimate•Maximum a posteriori estimate with prior W~N(0,σI) MAP estimates and Regularization•Maximum a posteriori estimate with prior W~N(0,σI)called a “regularization” term•helps reduce overfitting•keep weights nearer to zero (if P(W) is zero meanGaussian prior), or whatever the prior suggests •used very frequently in Logistic RegressionThe Bottom Line• Consider learning f: XàY, where•X is a vector of real-valued features, < X1 … Xn > • Y is boolean
•assume all Xi are conditionally independent given Y •model P(Xi | Y = yk) as Gaussian N(μik,σi)
•model P(Y) as Bernoulli (π)
•Then P(Y|X) is of this form, and we can directly estimate W
•Furthermore, same holds if the Xi are boolean •trying proving that to yourself

Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: XàY, or P(Y|X)
Generative classifiers (e.g., Naïve Bayes)
•Assume some functional form for P(X|Y), P(X)
•Estimate parameters of P(X|Y), P(X) directly from training data
•Use Bayes rule to calculate P(Y|X= xi)
Discriminative classifiers (e.g., Logistic regression)
•Assume some functional form for P(Y|X)
•Estimate parameters of P(Y|X) directly from training data

Use Naïve Bayes or Logisitic Regression?
Consider
•Restrictiveness of modeling assumptions
•Rate of convergence (in amount of training data) toward asymptotic hypothesis

Naïve Bayes vs Logistic Regression
Consider Y boolean, Xi continuous, X= Number of parameters to estimate:
•NB:
•LR:

Naïve Bayes vs Logistic Regression
Consider Y boolean, Xi continuous, X=
Number of parameters: •NB: 4n +1
•LR: n+1
Estimation method:
•NB parameter estimates are uncoupled •LR parameter estimates are coupled

G.Naïve Bayes vs. Logistic Regression
Recall two assumptions deriving form of LR from GNBayes: 1.Xi conditionally independent of Xk given Y
2.P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)
Consider three learning methods: •GNB (assumption 1 only)
•GNB2 (assumption 1 and 2)
•LR
Which method works better if we have infinite training data, and… •Both (1) and (2) are satisfied
•Neither (1) nor (2) is satisfied
•(1) is satisfied, but not (2)

G.Naïve Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
Recall two assumptions deriving form of LR from GNBayes: 1.Xi conditionally independent of Xk given Y
2.P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)
Consider three learning methods: • GNB (assumption 1 only) • GNB2 (assumption 1 and 2)
• LR
Which method works better if we have infinite training data, and… • Both (1) and (2) are satisfied
• Neither (1) nor (2) is satisfied
• (1) is satisfied, but not (2)

G.Naïve Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
Recall two assumptions deriving form of LR from GNBayes: 1.Xi conditionally independent of Xk given Y
2.P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)
Consider three learning methods:
• GNB (assumption 1 only) — decision surface can be non-linear • GNB2 (assumption 1 and 2) – decision surface linear
• LR — decision surface linear, trained without
assumption 1.
Which method works better if we have infinite training data, and…
• Both (1) and (2) are satisfied: LR = GNB2 = GNB
• (1) is satisfied, but not (2) : GNB > GNB2, GNB > LR, LR > GNB2 • Neither (1) nor (2) is satisfied: GNB>GNB2, LR > GNB2, LR>
So, GNB requires n = O(log d) to converge, but LR requires n = O(d)

Some experiments from UCI data sets
[Ng & Jordan, 2002]

Naïve Bayes vs. Logistic Regression
The bottom line:
GNB2 and LR both use linear decision surfaces, GNB need not
Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did).
But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error
And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might outperform the other

What you should know:
•Logistic regression
–Functional form follows from Naïve Bayes assumptions •For Gaussian Naïve Bayes assuming variance σi,k = σi
•For discrete-valued Naïve Bayes too
–But training procedure picks parameters without making conditional independence assumption
–MLE training: pick W to maximize P(Y | X, W)
–MAP training: pick W to maximize P(W | X,Y) •‘regularization’
•helps reduce overfitting
•Gradient ascent/descent
–General approach when closed-form solutions unavailable
•Generative vs. Discriminative classifiers –Bias vs. variance tradeoff

extra slides

What is the minimum possible error?
Best case:
•conditionalindependenceassumptionissatistied
•weknowP(Y),P(X|Y)perfectly(e.g.,infinitetrainingdata)

Questions to think about:
•Can you use Naïve Bayes for a combination of discrete and real-valued Xi?
•How can we easily model the assumption that just 2 of the n attributes as dependent?
•What does the decision surface of a Naïve Bayes classifier look like?
•How would you select a subset of Xi’s?

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS代考计算机代写 algorithm Machine Learning 10-601
30 $