CS 4610/5335 Logistic Regression
Robert Platt Northeastern University
Material adapted from:
1. Lawson Wong, CS 5100
Use features (x) to predict targets (y)
Classification
Classification
Targets y are now either: Binary: {0, 1}
Multi-class: {1, 2, , K}
We will focus on binary case (Ex5 Q6 covers multi-class)
2
Classification
Focus: Supervised learning (e.g., regression, classification) Use features (x) to predict targets (y)
Input: Dataset of n samples: {x(i), y(i)}, i = 1, , n
Each x(i) is a p-dimensional vector of feature values
Output: Hypothesis h(x) in some hypothesis class H
H is parameterized by d-dim. parameter vector
Goal: Find the best hypothesis * within H
What does best mean? Optimizes objective function:
J(): Error fn. L(pred, y): Loss fn.
A learning algorithm is the procedure for optimizing J()
3
Recall: Linear Regression
Hypothesis class for linear regression:
4
Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
5
Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
Semicolon distinguishes between
random variables that are being conditioned on (x) and parameters () (not considered random variables)
6
Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
In the binary case, P(y=0 | x; ) = 1 h(x)
d = dimension of parameter vector
p = dimension of feature space x
For logistic regression, d = p + 1 (same as before)
7
Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
= sigmoid function (also known as logistic function)
8
Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
The hypothesis class is the set of linear classifiers
x2 x2 x1
x1
9
Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
The hypothesis class is the set of linear classifiers
x2 x2 x1
x1
10
Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
11
Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
12
Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
Log-linear model
13
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J():
14
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J():
Should we do the same J() for logistic regression?
15
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J():
Should we do the same J() for logistic regression?
For linear regression, why did we use this particular J()? See Ex5 Q4 for a different error function to derive ridge regression, a variant of linear regression
16
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J():
Should we do the same J() for logistic regression?
For linear regression, why did we use this particular J()? See Ex5 Q4 for a different error function to derive ridge regression, a variant of linear regression
Why squared-error loss L(h, y)?
See Ex5 Q5 for a derivation of the squared-error loss and J() using the maximum-likelihood principle, assuming that labels have Gaussian noise
17
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J():
Should we do the same J() for logistic regression?
18
Apply Maximum-Likelihood to Logistic Regression A learning algorithm is the procedure for optimizing J():
Should we do the same J() for logistic regression? No. We will apply the maximum-likelihood principle.
19
Apply Maximum-Likelihood to Logistic Regression
A learning algorithm is the procedure for optimizing J():
Should we do the same J() for logistic regression? No. We will apply the maximum-likelihood principle.
Will show: Max likelihood is equivalent to minimizing J():
with the cross-entropy loss (logistic loss, log loss):
L(h (x(i)), y(i)) =
20
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
21
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = ?
22
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
23
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
24
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
25
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
26
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L() conventionally a function of
27
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L() conventionally a function of For the biased coin (: P(H) = q):
28
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
29
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
Calculus review: Product rule of differentiation
30
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
To find maximizing value of q, set derivative to 0 and solve:
31
Apply Maximum-Likelihood to Logistic Regression
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L() conventionally a function of For the biased coin (P(H) = q):
Maximizing likelihood leads us to infer q =
32
Logistic regression: Example: MNIST
33
Logistic regression: Example: MNIST (0 vs. 1)
34
Features?
Logistic regression: Example
35
Features?
Logistic regression: Example
Bias-term only (0)
36
Features?
Logistic regression: Example
Bias-term only (0)
Learning curve:
Shows progress
of learning
Steep is good!
37
Features?
Logistic regression: Example
Bias-term only (0)
Learning curve:
Shows progress
of learning
Steep is good!
Why is line
not at 50%?
38
Features?
Logistic regression: Example
Bias-term only (0)
Learning curve:
Shows progress
of learning
Steep is good!
Why is line
not at 50%?
In test set:
0: 980 samples 1: 1135 samples
39
Features?
Logistic regression: Example
0: Bias-term
1: Sum of pixel values
40
Features?
Logistic regression: Example
0: Bias-term
1: Sum of pixel values
Even worse!
41
Features?
Logistic regression: Example
0: Bias-term
1: Mean of pixel values
Mean = Sum / (28*28) (digit images are
28*28, grayscale)
42
Features?
Logistic regression: Example
0: Bias-term
1: Mean of pixel values
Mean = Sum / (28*28) (digit images are
28*28, grayscale)
Theoretically, should not make a difference!
43
Features?
Logistic regression: Example
0: Bias-term
1: Mean of pixel values
Mean = Sum / (28*28)
Theoretically, should not make a difference!
In practice, it does. Useful to normalize data
(0 mean 1)
44
Features?
Logistic regression: Example
0: Bias-term
1: Mean of pixel values
Achieves 93% accuracy. Surprising?
45
Features?
Logistic regression: Example
0: Bias-term
1: Mean of pixel values
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
46
Features?
Logistic regression: Example
0: Bias-term
1: Mean of pixel values
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
47
Features?
Logistic regression: Example
0: Bias-term Learned weights: 0 = 0.08538 1: Mean of pixel values 1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
48
Features?
Logistic regression: Example
0: Bias-term Learned weights: 0 = 0.08538 1: Mean of pixel values 1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
Model predicts P(y=1 | x; ) > 0.5 when0 +1x>0
when x < 0.08538 / 0.7640 0.111849 Features?Logistic regression: Example 0: Bias-term Learned weights: 0 = 0.08538 1: Mean of pixel values 1 = -0.7640Achieves 93% accuracy. Surprising?Always visualize (where possible). Always check.Model predicts P(y=1 | x; ) > 0.5 when0 +1x>0
when x < 0.08538 / 0.7640 0.1118Conclusion: Classifying0 and 1 is quite easy…50 Features?Logistic regression: Example 0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means51 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means52 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means53 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means97.26% accuracy Which featuresare useful?54 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means97.26% accuracy Which featuresare useful?Do ablation analysis:Remove featuresand compare performance55Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means97.26% accuracy Which featuresare useful?Do ablation analysis:Remove featuresand compare performance56 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means93.85% accuracy Which featuresare useful?Do ablation analysis:Remove featuresand compare performance57 Features?0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means93.85% accuracy When row/col sumsare present:pixel mean not useful bias term is usefulLogistic regression: Example 58 Features?Logistic regression: Example 0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel values59 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel values60 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1)61 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1) 62 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1)63 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:64 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:65 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:66 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:… continued in next lecture67 Logistic Regression Another look at the full algorithm…68 Logistic Regression Hypothesis class:Predict the probability that x belongs to some class:This is a probabilistic model of the data!Use maximum-likelihood principle to estimate 69 Logistic Regression Hypothesis class:Predict the probability that x belongs to some class:This is a probabilistic model of the data!Use maximum-likelihood principle to estimate Likelihood function for supervised learning: 70 Logistic RegressionUse maximum-likelihood principle to estimate Likelihood function for supervised learning:71 Logistic RegressionUse maximum-likelihood principle to estimate Likelihood function for supervised learning:Common assumption: Data samples areindependent and identically distributed (IID), given 72 Logistic RegressionUse maximum-likelihood principle to estimate Likelihood function for supervised learning:Common assumption: Data samples areindependent and identically distributed (IID), given Easier to handle in log space: log-likelihood function l(): Since log is monotonic, l() and L() have same maximizer73Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:max 74 Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:maxProbabilistic model in logistic regression:75 Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:maxProbabilistic model in logistic regression: 76 Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:maxProbabilistic model in logistic regression: Simplify with a trick:77Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:max78 Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:maxRecall that in supervised learning, we minimize J():79 Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:maxRecall that in supervised learning, we minimize J():Highlighted terms is (negative) loss function! 80 Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:maxRecall that in supervised learning, we minimize J():Highlighted terms is (negative) loss function!L(h (x(i)), y(i)) = Names: Cross-entropy loss, logistic loss, log loss81 Logistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:maxRecall that in supervised learning, we minimize J():Highlighted terms is (negative) loss function!L(h (x(i)), y(i)) = Names: Cross-entropy loss, logistic loss, log loss Instead of defining the loss / error function arbitrarily,82we derived it using the maximum-likelihood principleLogistic RegressionUse maximum-likelihood principle to estimate Find that maximizes the log-likelihood function:maxTo solve this optimization problem, find the gradient, and do (stochastic) gradient ascent (max vs. min) 83Logistic Regression See next slide84Logistic Regression 85 Logistic RegressionRecall: Hence: 86 Logistic RegressionSee prev slide87 Logistic RegressionBias term; equivalent if x (i) = 1 0See prev slide 88 Logistic RegressionSimilar form as linear regression gradient! Both are generalized linear models (GLMs)89 Logistic Regression 90 Logistic Regression Can similarly extend to stochastic / minibatch versions With linear algebra: Iteratively reweighted least squares91 Logistic RegressionThe remainder of this lecture is not covered on the exam.92 Logistic regression: Example: MNIST93 Logistic regression: Example Features?0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1)94 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1) 95 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1)96 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:97 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:98 Features?Logistic regression: Example0: Bias-term1: Mean of pixels2 to 29: Row means 30 to 57: Col means 58 to 841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:99 Logistic regression: Example: MNIST (0 vs. 1) We could try to match image patches… (e.g., lines, curves)100 Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center101 Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center102 Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center3 false positives. (99.86% accuracy) Previous two, and a new one:103Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center104 Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center105 Logistic regression: Example: MNISTMulti-class: 10-class classification106 Logistic regression: ExampleBinary logistic regression: 107 Logistic regression: ExampleBinary logistic regression: Multi-class logistic regression: There are now K parameter vectors one vector per class Number of parameters d = K * (p+1)108 Logistic regression: Example Features:0: Bias-term1 to 784: Individual pixel values109 Logistic regression: ExampleFeatures:0: Bias-term1 to 784: Individual pixel values110 Logistic regression: ExampleFeatures:0: Bias-term1 to 784: Individual pixel values87.99% accuracyWhat accuracy does predicting at random achieve?111 Logistic regression: Example Features:0: Bias-term1 to 784: Individual pixel values112 Logistic regression: ExampleFeatures:0: Bias-term1 to 784: Individual pixel valuesVisualize error: Confusion matrixRow = true classCol = predicted class (on diagonal = correct) 113 Logistic regression: ExampleFeatures:0: Bias-term1 to 784: Individual pixel valuesVisualize error: Confusion matrixRow = true classCol = predicted class(on diagonal = correct) if most correct, only looking at off-diagonal is useful114Logistic regression: ExampleFeatures:0: Bias-term1 to 784: Individual pixel values87.99% accuracyMaybe not enough data? Graph only uses3200 of 60000 samples available in training set115 Logistic regression: Example N = 180000 samples using each sample 3 times116 Logistic regression: Example N = 180000 samples using each sample 3 times117 Logistic regression: Example N = 180000 samples using each sample 3 times 118 Logistic regression: Example N = 180000 samples using each sample 3 times Sample ordering: Going through training set in order, round-robin through the classes119 Logistic regression: Example N = 180000 samples using each sample 3 times Sample ordering: Going through training set in order, round-robin through the classes… but there are 6742 samples in class 1 (out of 60000) all classes except 1 and 7 have < 6000 samples120 Logistic regression: ExampleSmaller example: Class 0:1760 samples Classes 1-9:160 samplesTrain round-robin classes 0-90123456789 0123456789…0000000000 (row161) …121 Logistic regression: Example Smaller example: Class 0:1760 samples Classes 1-9:160 samplesTrain round-robin classes 0-9Modified order: 0000000001 0000000002 …122 Logistic regression: ExampleSmaller example: Class 0:1760 samples Classes 1-9:160 samplesTrain round-robin classes 0-9Modified order: 0000000001 0000000002 …123 Logistic regression: Example Even in balanced case, data splicing(training round-robin through classes)is importantOrder:320 class 0 samples 320 class 1 samples …320 class 9 samples124 Logistic regression: ExampleEven in balanced case, data splicing(training round-robin through classes)is importantOrder:320 class 0 samples 320 class 1 samples …320 class 9 samples125 Bag of tricksNormalize features (whiten mean 0, variance 1) Numerical issues (exp, log) log-sum-exp trick Balanced data (do not have too much of one class) Splicing data (alternate between classes during training) Data augmentation (increase variation in training data) One in ten rule at least 10:1 sample:parameter ratio Use the same random seed (for debugging)If results are too good to be true, be very skeptical Did you use the test set for training?Visualize! (Learning curves, feature weights, errors, etc.)126 Role of hyperparametersChoose with cross-validation 127 Role of hyperparametersChoose with cross-validation 128 Logistic regression: Example N = 320000 samples using each sample 5-6 timesWith appropriate data balancing and splicing: 91.93% test accuracy(vs. 87.99% for N = 3200)129
Reviews
There are no reviews yet.