5/5 - (1 vote)

Sponsored Search Acution Design Via Machine Learning

Boosting Approach to ML
Maria-Florina Balcan
03/18/2015
Perceptron, Margins, Kernels

Recap from last time: Boosting
Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.
Works amazingly well in practice.
Adaboost one of the top 10 ML algorithms.
General method for improving the accuracy of any given learning algorithm.
Backed up by solid foundations.

2
Add a slide with Adaboost!

Adaboost (Adaptive Boosting)

For t=1,2, ,T

Constructon {, , }
Run A onproducing
,
+
+
+
+
+
+
+
+

Output
Input: S={(), ,()};
weak learning algo A
(e.g., Nave Bayes, decision stumps)
puts half of weight on exampleswhereis incorrect & half on examples whereis correct
if

[i.e., ]
Givenandset

uniform on {, , }

3
Add a slide with Adaboost!

Nice Features of Adaboost
Very general: a meta-procedure, it can use any weak learning algorithm!!!
Very fast (single pass through data each round) & simple to code, no parameters to tune.
Grounded in rich theory.
(e.g., Nave Bayes, decision stumps)

Analyzing Training Error

Theorem
(error ofover )

So, if
, then
Adaboost is adaptive
Does not need to knowor T a priori
Can exploit

The training error drops exponentially in T!!!
To get , need only rounds

5
Add a slide with Adaboost!

Generalization Guarantees
G={all fns of the form }
is a weighted vote, so the hypothesis class is:
Theorem [Freund&Schapire97]

T= # of rounds
Key reason:plus typical VC bounds.
H space of weak hypotheses; d=VCdim(H)

Theorem
where

How about generalization guarantees?

Original analysis [Freund&Schapire97]

Generalization Guarantees
Theorem [Freund&Schapire97]

where d=VCdim(H)
error
complexity

train error

generalization
error
T= # of rounds

Generalization Guarantees
Experiments showed that the test error of the generated classifier usually does not increase as its size becomes very large.
Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Generalization Guarantees
Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!
These results seem to contradict FS97 bound and Occams razor (in order achieve good test error the classifier should be as simple as possible)!

Experiments showed that the test error of the generated classifier usually does not increase as its size becomes very large.

How can we explain the experiments?
Key Idea:
R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in Boosting the margin: A new explanation for the effectiveness of voting methods a nice theoretical explanation.
Training error does not tell the whole story.
We need also to consider the classification confidence!!

10
Add a slide with Adaboost!

Boosting didnt seem to overfit(!)

test error
train error
test error of base classifier (weak learner)
Error Curve, Margin Distr. Graph Plots from [SFBL98]

because it turned out to be increasing the margin of the classifier

Classification Margin
H space of weak hypotheses. The convex hull of H:
Let.
The majority vote rulegiven by (given by )predicts wrongly on exampleiff .

Definition: margin of(or of ) on exampleto be .

The margin is positive iff .
See as the strength or the confidence of the vote.

1
High confidence, correct
-1
High confidence, incorrect
Low confidence

12
Add a slide with Adaboost!

Boosting and Margins
Theorem:, then with prob. , , ,

Note:bound does not depend onT (the # of rounds of boosting), depends only on the complex. of the weak hyp space and the margin!

13
Add a slide with Adaboost!

Boosting and Margins
If all training examples have large margins, then we can approximate the final classifier by a much smaller classifier.

Can use this to prove that better margin smaller test error, regardless of the number of weak classifiers.
Can also prove that boosting tends to increase the margin of training examples by concentrating on those of smallest margin.
Although final classifier is getting larger, margins are likely to be increasing, so the final classifier is actually getting closer to a simpler classifier, driving down test error.
Theorem:, then with prob. , , ,

14
Add a slide with Adaboost!

Boosting and Margins
Theorem:, then with prob. , , ,

Note:bound does not depend onT (the # of rounds of boosting), depends only on the complex. of the weak hyp space and the margin!

15
Add a slide with Adaboost!

Shift in mindset: goal is now just to find classifiers a bit better than random guessing.
Relevant for big data age: quickly focuses on core difficulties, so well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT12].
Backed up by solid foundations.
Adaboost work and its variations well in practice with many kinds of data (one of the top 10 ML algos).
More about classic applications in Recitation.
Boosting, Adaboost Summary

Issues: noise. Weak learners
16

Interestingly, the usefulness of margin recognized in Machine Learning since late 50s.
Perceptron [Rosenblatt57] analyzed via geometric(aka ) margin.
Original guarantee in the online learning scenario.

Issues: noise. Weak learners
17

The Perceptron Algorithm
Online Learning Model
Margin Analysis
Kernels

Issues: noise. Weak learners
18

Mistake bound model

Example arrive sequentially.
The Online Learning Model
We need to make a prediction.
Afterwards observe the outcome.
Analysis wise, make no distributional assumptions.
Goal: Minimize the number of mistakes.
Online Algorithm

Example
Prediction
Phase i:
Observe
For i=1, 2, , :

Issues: noise. Weak learners
19

The Online Learning Model. Motivation
Email classification (distribution of both spam and regular mail changes over time, but the target function stays fixed last years spam still looks like spam).
Add placement in a new market.
Recommendation systems. Recommending movies, etc.
Predicting whether a user will be interested in a new news article or not.

Issues: noise. Weak learners
20

Linear Separators
X
X
X
X
X
X
X
X
X
X
O
O
O
O
O
O
O
O

w
Instance space
Hypothesis class of linear decision surfaces in .
, if , then label x as +, otherwise label it as
Claim: WLOG .
Proof: Can simulate a non-zero threshold with a dummy input featurethat is always set up to 1.

iff
where

Issues: noise. Weak learners
21

Set t=1, start with the all zero vector.
Linear Separators: Perceptron Algorithm
Given example , predict positive iff
On a mistake, update as follows:
Mistake on positive, then update
Mistake on negative, then update
Note:
is weighted sum of incorrectly classified examples

Important when we talk about kernels.

Issues: noise. Weak learners
22

Perceptron Algorithm: Example
Example:

+
+

Algorithm:
Set t=1, start with all-zeroes weight vector .
Given example , predict positive iff
On a mistake, update as follows:
Mistake on positive, update
Mistake on negative, update

Geometric Margin
Definition: The margin of examplew.r.t. a linear sep.is the distance from to the plane (or the negative if on wrong side)

w
Margin of positive example

Margin of negative example

Geometric Margin
Definition: The marginof a set of exampleswrt a linear separatoris the smallest margin over points .

+
+
+
+
+
+

+
w
Definition: The margin of examplew.r.t. a linear sep.is the distance from to the plane (or the negative if on wrong side)

+
+
+
+

w
Definition: The marginof a set of examplesis the maximumover all linear separators .
Geometric Margin
Definition: The marginof a set of exampleswrt a linear separatoris the smallest margin over points .
Definition: The margin of examplew.r.t. a linear sep.is the distance from to the plane(or the negative if on wrong side)

Perceptron: Mistake Bound
Theorem: If data has marginand all points inside a ball of radius , then Perceptron makesmistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100, doesnt change the number of mistakes; algo is invariant to scaling.)
+
w*

+
+
+
+
+
+

+
w*

Perceptron Algorithm: Analysis
Theorem: If data has marginand all points inside a ball of radius , then Perceptron makesmistakes.
Update rule:
Mistake on positive:
Mistake on negative:
Proof:
Idea: analyzeand , whereis the max-margin sep, .
Claim 1: .
Claim 2: .
(because )
(by Pythagorean Theorem)

Aftermistakes:
(by Claim 1)
(by Claim 2)
(sinceis unit length)
So, , so .

Perceptron Extensions
Can use it to find a consistent separator (by cycling through the data).
One can convert the mistake bound guarantee into a distributional guarantee too (for the case where the s come from a fixed distribution).
Can be adapted to the case where there is no perfect separator as long as the so called hinge loss (i.e., the total distance needed to move the points to classify them correctly large margin) is small.
Can be kernelized to handle non-linear decision boundaries!

Perceptron Discussion
Simple online algorithm for learning linear separators with a nice guarantee that depends only on the geometric (aka) margin.
Simple, but very useful in applications like Branch prediction; it also has interesting extensions to structured prediction.
It can be kernelized to handle non-linear decision boundaries see next class!

/docProps/thumbnail.jpeg

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] CS algorithm Sponsored Search Acution Design Via Machine Learning

Reviews

Whatsapp Us

[SOLVED] CS algorithm Sponsored Search Acution Design Via Machine Learning

Reviews

Related products

[SOLVED] pakudex

[Solved] Payroll calculation program-Python

[Solved] BinaryAdd

[SOLVED] COP 3223 Program #4: Turtle Time and List Power

[Solved] Problem 3: Who are the Winners

[SOLVED] SciCalculator