[SOLVED] CS algorithm Machine Learning 10-601

$25

File Name: CS_algorithm_Machine_Learning_10-601.zip
File Size: 339.12 KB

5/5 - (1 vote)

Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
February 4, 2015
Today:
Generative discriminative classifiers
Linear regression
Decomposition of error into bias, variance, unavoidable
Readings:
Mitchell: Naive Bayes and Logistic Regression
(required)
Ng and Jordan paper (optional)
Bishop, Ch 9.1, 9.2 (optional)

Logistic Regression
Consider learning f: XY, where
X is a vector of real-valued features, < X1 … Xn > Y is boolean
assume all Xi are conditionally independent given Y model P(Xi | Y = yk) as Gaussian N(ik,i)
model P(Y) as Bernoulli ()
Then P(Y|X) is of this form, and we can directly estimate W
Furthermore, same holds if the Xi are boolean trying proving that to yourself
Train by gradient ascent estimation of ws (no assumptions!)

MLE vs MAP
Maximum conditional likelihood estimate
Maximum a posteriori estimate with prior W~N(0,I)

MAP estimates and Regularization
Maximum a posteriori estimate with prior W~N(0,I)
called a regularization term
helps reduce overfitting, especially when training data is sparse
keep weights nearer to zero (if P(W) is zero mean Gaussian prior), or whatever the prior suggests
used very frequently in Logistic Regression

Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: XY, or P(Y|X)
Generative classifiers (e.g., Naive Bayes)
Assume some functional form for P(Y), P(X|Y)
Estimate parameters of P(X|Y), P(Y) directly from training data
Use Bayes rule to calculate P(Y=y |X= x)
Discriminative classifiers (e.g., Logistic regression)
Assume some functional form for P(Y|X)
Estimate parameters of P(Y|X) directly from training data
NOTE: even though our derivation of the form of P(Y|X) made GNB- style assumptions, the training procedure for Logistic Regression does not!

Use Naive Bayes or Logisitic Regression?
Consider
Restrictiveness of modeling assumptions (how well can we learn with infinite data?)
Rate of convergence (in amount of training data) toward asymptotic (infinite data) hypothesis
i.e., the learning curve

Naive Bayes vs Logistic Regression
Consider Y boolean, Xi continuous, X=
Number of parameters: NB: 4n +1
LR: n+1
Estimation method:
NB parameter estimates are uncoupled LR parameter estimates are coupled

Gaussian Naive Bayes Big Picture
assume P(Y=1) = 0.5

Gaussian Naive Bayes Big Picture
assume P(Y=1) = 0.5

G.Naive Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y
2. P(Xi | Y = yk) = N(ik,i), not N(ik,ik)
Consider three learning methods: GNB (assumption 1 only) GNB2 (assumption 1 and 2)
LR
Which method works better if we have infinite training data, and Both (1) and (2) are satisfied
Neither (1) nor (2) is satisfied
(1) is satisfied, but not (2)

G.Naive Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y
2. P(Xi | Y = yk) = N(ik,i), not N(ik,ik)
Consider three learning methods:
GNB (assumption 1 only) decision surface can be non-linear GNB2 (assumption 1 and 2) decision surface linear
LR decision surface linear, trained differently
Which method works better if we have infinite training data, and
Both (1) and (2) are satisfied: Neither (1) nor (2) is satisfied: (1) is satisfied, but not (2) :
LR = GNB2 = GNB
LR > GNB2, GNB>GNB2
GNB > LR, LR > GNB2

G.Naive Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
What if we have only finite training data?
They converge at different rates to their asymptotic ( data) error
Let refer to expected error of learning algorithm A after n training examples
Let d be the number of features:
So, GNB requires n = O(log d) to converge, but LR requires n = O(d)

Some experiments from UCI data sets
[Ng & Jordan, 2002]

Naive Bayes vs. Logistic Regression
The bottom line:
GNB2 and LR both use linear decision surfaces, GNB need not
Given infinite data, LR is better than GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did).
But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error
And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might beat the other

Rate of covergence: logistic regression
[Ng & Jordan, 2002]
Let hDis,m be logistic regression trained on m examples in n dimensions. Then with high probability:
Implication: if we want
for some constant , it suffices to pick order n examples
Convergences to its asymptotic classifier, in order n examples (result follows from Vapniks structural risk bound, plus fact that VCDim of n dimensional linear separators is n )

Rate of covergence: naive Bayes parameters
[Ng & Jordan, 2002]

What you should know:
Logistic regression
Functional form follows from Naive Bayes assumptions For Gaussian Naive Bayes assuming variance i,k = i
For discrete-valued Naive Bayes too
But training procedure picks parameters without the conditional independence assumption
MCLE training: pick W to maximize P(Y | X, W)
MAP training: pick W to maximize P(W | X,Y) regularization: e.g., P(W) ~ N(0,)
helps reduce overfitting
Gradient ascent/descent
General approach when closed-form solutions for MLE, MAP are
unavailable
Generative vs. Discriminative classifiers Bias vs. variance tradeoff

Machine Learning 10-701
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
February 4, 2015
Today:
Linear regression
Decomposition of error into bias, variance, unavoidable
Readings:
Mitchell: Naive Bayes and Logistic Regression
(see class website)
Ng and Jordan paper (class website)
Bishop, Ch 9.1, 9.2

Regression
So far, weve been interested in learning P(Y|X) where Y has discrete values (called classification)
What if Y is continuous? (called regression)
predict weight from gender, height, age,
predict Google stock price today from Google, Yahoo, MSFT prices yesterday
predict each pixel intensity in robots current camera image, from previous image and previous action

Regression
Wish to learn f:XY, where Y is real, given { }
Approach:
1.choose some parameterized form for P(Y|X; ) ( is the vector of parameters)
2.derive learning algorithm as MCLE or MAP estimate for

1. Choose parameterized form for P(Y|X; )
Y
X
Assume Y is some deterministic f(X), plus random noise
where
Therefore Y is a random variable that follows the distribution
and the expected value of y for any given x is f(x)

1. Choose parameterized form for P(Y|X; )
Y
X
Assume Y is some deterministic f(X), plus random noise
where
Therefore Y is a random variable that follows the distribution
and the expected value of y for any given x is f(x)

Consider Linear Regression
E.g., assume f(x) is linear function of x
Notation: to make our parameters explicit, lets write

Training Linear Regression
How can we learn W from the training data?

Training Linear Regression
How can we learn W from the training data? Learn Maximum Conditional Likelihood Estimate!
where

Training Linear Regression
Learn Maximum Conditional Likelihood Estimate where

Training Linear Regression
Learn Maximum Conditional Likelihood Estimate where

Training Linear Regression
Learn Maximum Conditional Likelihood Estimate where
so:

Training Linear Regression
Learn Maximum Conditional Likelihood Estimate Can we derive gradient descent rule for training?

How about MAP instead of MLE estimate?

Regression What you should know
Under general assumption
1.MLE corresponds to minimizing sum of squared prediction errors
2.MAP estimate minimizes SSE plus sum of squared weights
3.Again, learning is an optimization problem once we choose our objective function
maximize data likelihood
maximize posterior prob of W
4.Again, we can use gradient descent as a general learning algorithm
as long as our objective fn is differentiable wrt W
though we might learn local optima ins
5.Almost nothing we said here required that f(x) be linear in x

Bias/Variance Decomposition of Error

Bias and Variance
given some estimator Y for some parameter , we define
the bias of estimator Y =
the variance of estimator Y =
e.g., define Y as the MLE estimator for probability of heads, based on n independent coin flips
biased or unbiased?
variance decreases as sqrt(1/n)

Bias Variance decomposition of error
Reading: Bishop chapter 9.1, 9.2
Consider simple regression problem f:XY y = f(x) +
noise N(0,) deterministic
What are sources of prediction error?
learned estimate of f(x)

Sources of error
What if we have perfect learner, infinite data?
Our learned h(x) satisfies h(x)=f(x) Still have remaining, unavoidable error
2

Sources of error
What if we have only n training examples? What is our expected error
Taken over random training sets of size n, drawn from distribution D=p(x,y)

Sources of error

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS algorithm Machine Learning 10-601
$25