[SOLVED] CS计算机代考程序代写 algorithm CS 4610/5335 Logistic Regression

30 $

File Name: CS计算机代考程序代写_algorithm_CS_4610/5335_Logistic_Regression.zip
File Size: 678.24 KB

SKU: 1433072293 Category: Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Or Upload Your Assignment Here:


CS 4610/5335 Logistic Regression
Robert Platt Northeastern University
Material adapted from:
1. Lawson Wong, CS 5100

Use features (x) to predict targets (y)
Classification
Classification
Targets y are now either: – Binary: {0, 1}
– Multi-class: {1, 2, …, K}
We will focus on binary case (Ex5 Q6 covers multi-class)
2

Classification
Focus: Supervised learning (e.g., regression, classification) Use features (x) to predict targets (y)
Input: Dataset of n samples: {x(i), y(i)}, i = 1, …, n
Each x(i) is a p-dimensional vector of feature values
Output: Hypothesis hθ(x) in some hypothesis class H
H is parameterized by d-dim. parameter vector θ
Goal: Find the best hypothesis θ* within H
What does “best” mean? Optimizes objective function:
J(θ): Error fn. L(pred, y): Loss fn.
A learning algorithm is the procedure for optimizing J(θ)
3

Recall: Linear Regression
Hypothesis class for linear regression:
4

Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
5

Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
Semicolon distinguishes between
random variables that are being conditioned on (x) and parameters (θ) (not considered random variables)
6

Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
In the binary case, P(y=0 | x; θ) = 1 – hθ(x)
d = dimension of parameter vector θ
p = dimension of feature space x
For logistic regression, d = p + 1 (same as before)
7

Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
σ = sigmoid function (also known as logistic function)
8

Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
The hypothesis class is the set of linear classifiers
x2 x2 x1
x1
9

Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
The hypothesis class is the set of linear classifiers
x2 x2 x1
x1
10

Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
11

Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
12

Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
Log-linear model
13

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
14

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
15

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
For linear regression, why did we use this particular J(θ)? See Ex5 Q4 for a different error function to derive ridge regression, a variant of linear regression
16

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
For linear regression, why did we use this particular J(θ)? See Ex5 Q4 for a different error function to derive ridge regression, a variant of linear regression
Why squared-error loss L(h, y)?
See Ex5 Q5 for a derivation of the squared-error loss and J(θ) using the maximum-likelihood principle, assuming that labels have Gaussian noise
17

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
18

Apply Maximum-Likelihood to Logistic Regression A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression? No. We will apply the maximum-likelihood principle.
19

Apply Maximum-Likelihood to Logistic Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression? No. We will apply the maximum-likelihood principle.
Will show: Max likelihood is equivalent to minimizing J(θ):
with the cross-entropy loss (logistic loss, log loss):
L(h (x(i)), y(i)) = θ
20

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
21

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = ?
22

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
23

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
24

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
25

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
26

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ
27

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ For the biased coin (θ: P(H) = q):
28

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
29

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
Calculus review: Product rule of differentiation
30

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
To find maximizing value of q, set derivative to 0 and solve:
31

Apply Maximum-Likelihood to Logistic Regression
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ For the biased coin (P(H) = q):
Maximizing likelihood leads us to infer q =
32

Logistic regression: Example: MNIST
33

Logistic regression: Example: MNIST (0 vs. 1)
34

Features?
Logistic regression: Example
35

Features?
Logistic regression: Example
Bias-term only (θ0)
36

Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
37

Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
Why is line
not at 50%?
38

Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
Why is line
not at 50%?
In test set:
0: 980 samples 1: 1135 samples
39

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Sum of pixel values
40

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Sum of pixel values
Even worse!
41

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28) (digit images are
28*28, grayscale)
42

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28) (digit images are
28*28, grayscale)
Theoretically, should not make a difference!
43

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28)
Theoretically, should not make a difference!
In practice, it does. Useful to normalize data
(0 ≤ mean ≤ 1)
44

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
45

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
46

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
47

Features?
Logistic regression: Example
θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
48

Features?
Logistic regression: Example
θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
Model predicts P(y=1 | x; θ) > 0.5 whenθ0 +θ1x>0
→ when x < 0.08538 / 0.7640 ≈ 0.111849 Features?Logistic regression: Example θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640Achieves 93% accuracy. Surprising?Always visualize (where possible). Always check.Model predicts P(y=1 | x; θ) > 0.5 whenθ0 +θ1x>0
→ when x < 0.08538 / 0.7640 ≈ 0.1118Conclusion: Classifying0 and 1 is quite easy…50 Features?Logistic regression: Example θ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means51 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means52 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means53 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means97.26% accuracy Which featuresare useful?54 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means97.26% accuracy Which featuresare useful?Do ablation analysis:Remove featuresand compare performance55Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means97.26% accuracy Which featuresare useful?Do ablation analysis:Remove featuresand compare performance56 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means93.85% accuracy Which featuresare useful?Do ablation analysis:Remove featuresand compare performance57 Features?θ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means93.85% accuracy When row/col sumsare present:pixel mean not useful bias term is usefulLogistic regression: Example 58 Features?Logistic regression: Example θ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel values59 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel values60 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1)61 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1) 62 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1)63 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:64 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:65 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:66 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:… continued in next lecture67 Logistic Regression Another look at the full algorithm…68 Logistic Regression Hypothesis class:Predict the probability that x belongs to some class:This is a probabilistic model of the data!Use maximum-likelihood principle to estimate θ69 Logistic Regression Hypothesis class:Predict the probability that x belongs to some class:This is a probabilistic model of the data!Use maximum-likelihood principle to estimate θLikelihood function for supervised learning: 70 Logistic RegressionUse maximum-likelihood principle to estimate θ Likelihood function for supervised learning:71 Logistic RegressionUse maximum-likelihood principle to estimate θ Likelihood function for supervised learning:Common assumption: Data samples areindependent and identically distributed (IID), given θ72 Logistic RegressionUse maximum-likelihood principle to estimate θ Likelihood function for supervised learning:Common assumption: Data samples areindependent and identically distributed (IID), given θEasier to handle in log space: log-likelihood function l(θ): Since log is monotonic, l(θ) and L(θ) have same maximizer73Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:max 74 Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:maxProbabilistic model in logistic regression:75 Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:maxProbabilistic model in logistic regression: 76 Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:maxProbabilistic model in logistic regression: Simplify with a trick:77Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:max78 Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:maxRecall that in supervised learning, we minimize J(θ):79 Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:maxRecall that in supervised learning, we minimize J(θ):Highlighted terms is (negative) loss function! 80 Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:maxRecall that in supervised learning, we minimize J(θ):Highlighted terms is (negative) loss function!L(h (x(i)), y(i)) = θNames: Cross-entropy loss, logistic loss, log loss81 Logistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:maxRecall that in supervised learning, we minimize J(θ):Highlighted terms is (negative) loss function!L(h (x(i)), y(i)) = θNames: Cross-entropy loss, logistic loss, log loss Instead of defining the loss / error function arbitrarily,82we derived it using the maximum-likelihood principleLogistic RegressionUse maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:maxTo solve this optimization problem, find the gradient, and do (stochastic) gradient ascent (max vs. min) 83Logistic Regression See next slide84Logistic Regression 85 Logistic RegressionRecall: Hence: 86 Logistic RegressionSee prev slide87 Logistic RegressionBias term; equivalent if x (i) = 1 0See prev slide 88 Logistic RegressionSimilar form as linear regression gradient! Both are generalized linear models (GLMs)89 Logistic Regression 90 Logistic Regression Can similarly extend to stochastic / minibatch versions With linear algebra: Iteratively reweighted least squares91 Logistic RegressionThe remainder of this lecture is not covered on the exam.92 Logistic regression: Example: MNIST93 Logistic regression: Example Features?θ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1)94 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1) 95 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel values99.91% test accuracy! (2 false positives:true 0, pred 1)96 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:97 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:98 Features?Logistic regression: Exampleθ0: Bias-termθ1: Mean of pixelsθ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individualpixel valuesThis is just memorizing pixel locations…If we perturb dataset with row/col shifts:99 Logistic regression: Example: MNIST (0 vs. 1) We could try to match image patches… (e.g., lines, curves)100 Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center101 Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center102 Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center3 false positives. (99.86% accuracy) Previous two, and a new one:103Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center104 Logistic regression: ExampleHack: Try to correct the perturbation with some processingCompute the center of mass of the perturbed image, and shift it back to the center105 Logistic regression: Example: MNISTMulti-class: 10-class classification106 Logistic regression: ExampleBinary logistic regression: 107 Logistic regression: ExampleBinary logistic regression: Multi-class logistic regression: There are now K parameter vectors – one vector per class – Number of parameters d = K * (p+1)108 Logistic regression: Example Features:θ0: Bias-termθ1 to θ784: Individual pixel values109 Logistic regression: ExampleFeatures:θ0: Bias-termθ1 to θ784: Individual pixel values110 Logistic regression: ExampleFeatures:θ0: Bias-termθ1 to θ784: Individual pixel values87.99% accuracyWhat accuracy does predicting at random achieve?111 Logistic regression: Example Features:θ0: Bias-termθ1 to θ784: Individual pixel values112 Logistic regression: ExampleFeatures:θ0: Bias-termθ1 to θ784: Individual pixel valuesVisualize error: Confusion matrixRow = true classCol = predicted class (on diagonal = correct) 113 Logistic regression: ExampleFeatures:θ0: Bias-termθ1 to θ784: Individual pixel valuesVisualize error: Confusion matrixRow = true classCol = predicted class(on diagonal = correct)– if most correct, only looking at off-diagonal is useful114Logistic regression: ExampleFeatures:θ0: Bias-termθ1 to θ784: Individual pixel values87.99% accuracyMaybe not enough data? Graph only uses3200 of 60000 samples available in training set115 Logistic regression: Example N = 180000 samples – using each sample 3 times116 Logistic regression: Example N = 180000 samples – using each sample 3 times117 Logistic regression: Example N = 180000 samples – using each sample 3 times 118 Logistic regression: Example N = 180000 samples – using each sample 3 times Sample ordering: Going through training set in order, round-robin through the classes119 Logistic regression: Example N = 180000 samples – using each sample 3 times Sample ordering: Going through training set in order, round-robin through the classes… but there are 6742 samples in class 1 (out of 60000) – all classes except 1 and 7 have < 6000 samples120 Logistic regression: ExampleSmaller example: Class 0:1760 samples Classes 1-9:160 samplesTrain round-robin classes 0-90123456789 0123456789…0000000000 (row161) …121 Logistic regression: Example Smaller example: Class 0:1760 samples Classes 1-9:160 samplesTrain round-robin classes 0-9Modified order: 0000000001 0000000002 …122 Logistic regression: ExampleSmaller example: Class 0:1760 samples Classes 1-9:160 samplesTrain round-robin classes 0-9Modified order: 0000000001 0000000002 …123 Logistic regression: Example Even in balanced case, “data splicing”(training round-robin through classes)is importantOrder:320 class 0 samples 320 class 1 samples …320 class 9 samples124 Logistic regression: ExampleEven in balanced case, “data splicing”(training round-robin through classes)is importantOrder:320 class 0 samples 320 class 1 samples …320 class 9 samples125 Bag of tricksNormalize features (whiten – mean 0, variance 1) Numerical issues (exp, log) – log-sum-exp trick Balanced data (do not have too much of one class) Splicing data (alternate between classes during training) Data augmentation (increase variation in training data) “One in ten rule” – at least 10:1 sample:parameter ratio Use the same random seed (for debugging)If results are too good to be true, be very skeptical – Did you use the test set for training?Visualize! (Learning curves, feature weights, errors, etc.)126 Role of hyperparametersChoose with cross-validation 127 Role of hyperparametersChoose with cross-validation 128 Logistic regression: Example N = 320000 samples – using each sample 5-6 timesWith appropriate data balancing and splicing: 91.93% test accuracy(vs. 87.99% for N = 3200)129

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS计算机代考程序代写 algorithm CS 4610/5335 Logistic Regression
30 $