5/5 - (1 vote)

Support vector machines
Chenlei Leng

Outline
Decision boundaries
Maximum marginal hyperplanes Support vector machines
The kernel trick

Decision boundaries
The classification problem
Given a Ddimensional input vector x, assign it to one and only one of K discrete classes; i.e., partition the input space into K decision regions whose boundaries are called decision boundaries.

6 4 2 0 2
x1
x2
6 4 2 0 2

1NN
One option: kNN classifier with k1.
To find the decision boundary, compute the Voronoi tesselation:
librarydeldir
tessdeldirX,1,X,2

6 4 2 0 2
x1
x2
6 4 2 0 2

2NN
Large region of ambiguity where the two nearest neighbours belong to opposite classes:
1.6
2
2
2
1.2
1.1
1.4

1.1
1.5
1.3
1.1
1.5
1.5
x2
6 4 2 0 2
6 4 2 0 2
x1

1.9
1.87
1.1
1.1

Support vector machines SVMs
Vapnik 1979 developed a theory of statistical learning.
Idea: Use hyperplanes to cut up space.
Using the kernel trick you can use these to efficiently construct nonlinear decision boundaries.
SVMs are a popular alternative to kNN or logistic regression for classification problems.

Oriented Hyperplanes in RDDot product:
D
x yxiyixycos.
i1
If0,theequationx0definesahyperplanethrough the origin.
If0,theequationx0definesahyperplane:
it is D1 dimensional,
at right angles to ,
passing through the point 2 .

RD issplitintoax0sideandax0side. e.g. in R2, x1,2T30 defines the line x12x230.

Two classes
We will cut up space by defining a discriminant function
f : RD1,,K, which maps each input to a predicted class label.
Definition Linear discriminant function for K2
A linear discriminant function for two classes is a discriminant
function of the form:
yx.x , f x
1 ifyx0, 2 ifyx0.
The sole decision boundary is y x.x0, the equation of a D1dimensional hyperplane.

K classes
How to deal with K2? Use voting?
One vs one: Train K classifiers. 2
One vs rest: Train K classifiers.
Each can lead to ambiguous regions of input space. One
solution for one vs rest:
Definition Linear discriminant function for K2 Let f : RD1,,K be given by:
ykxk.x k, k1,,K,
fxk if ykxyjx for all j k.
but each yk is trained on a different task and we dont know they are comparable i.e. whether they scale similarly.
Henceforth we will simplify things by assuming K2 and writing the class as 1, 1 so f x signy x .

Linear separability

1.5 1.0 0.5 0.0 0.5
x1
A set of labelled data
xn,ynRD 1,1 : n1,,N
is said to be linearly separable if each xn can be correctly classified by a linear decision boundary i.e. hyperplane.
Definition Linear separability
x2
1.5 1.0 0.5 0.0 0.5

Linear separability
Training data is linearly separable if
, : n, signxnyn.
Equivalently
, : n, ynxn0.
, : n, ynxn1 andis minimal.
Whyminimal? It gives the maximum margin hyperplane. The inequality is tight for some y n1 and some y n1.
Equivalently,

Maximal margin hyperplane
When the data are linearly separable,
n, ynxn1 andis minimal.
Let d, d denote the minimum distances from the 1 and 1 labels to the decision plane x0, respectively.
Simple geometrical arguments give: Margin : dd
N.B. Sometimes this is taken to be twice the margin.
The unique solution forandcan be found by convex optimisation.
2 .
This maximum margin hyperplane is determined by the points exactly on the margins: these are called the support vectors.
Support vector machines SVMs generalize this idea to nonseparable data.

Maximal margin hyperplane
librarye1071
mmhsvmX, y, scaleF, kernellinear, cost106
alphammhrho
betatmmhcoefsmmhSV

1.5 1.0 0.5 0.0 0.5
x2
1.5 1.0 0.5 0.0 0.5

Nonseparable data
How to fit a hyperplane when the data are not separable?e.g. our original motivating example:

6 4 2 0 2
x1
x2
6 4 2 0 2

Tolerance level
Introduce extra slack variables n to allow for some misclassification of training data.
Choose ,,1,,N to minimize 12 C n n subject 2
to the following constraints:
1. n, n0,
2. n, ynxn1n,
Notice that errorsn I1, n n n .
Note: n0, 1 is punished, even though the label is correct.
Cost C can be chosen by crossvalidation.

Hinge loss
Note that the solution only depends on those points satisfying ynxn1n.
Such points are called the support vectors. Some will be exactly on the margin boundary n0, some on the wrong side of the margin boundary n0.
This asymmetric cost function is known as the hinge loss:
ly,xmax0,1yx
2 1 0 1 2
z
max0,1z
0.0 0.5 1.0 1.5 2.0 2.5 3.0

Nonseparable data
inssvmX, y, scaleF, kernellinear, cost0.1

10 5 0 5 10
x1
x2
6 4 2 0 2

Nonseparable data
Higher cost parameter, tighter fit:
inssvmX, y, scaleF, kernellinear, cost10

10 5 0 5 10
x1
x2
6 4 2 0 2

Summary
SVMs use separating hyperplanes to cut up the Ddimensional input space.
explicit representation of the decision boundary
Solution optimises the margin between data points and hyperplane
Separable data: perfect classification of training set
Nonseparable data: requires error terms n, n1,N.
Hinge loss only punishes points on the wrong side of the margin boundary.
Next time: What about nonlinear decision boundaries?

Recap: Linear SVMsseparable case
Maximal margin hyperplane x0:
n, ynxn1 andis minimal.

x1 x0
x1
Margin 2

Recap: Linear SVMs, nonseparable case
Introduce extra variables n0 and minimize 12C n n
such that n, ynxn1n. n0
2

x1 x0
x1
n1
Margin 2
n1
n0

Nonlinear decision boundary
The separating hyperplane is always flat, with dimension D1. But we can increase the available dimension by augmenting our data:
XaugcbindX, X,22

4 2

2 4
6 8
6 4 2 0 2 4 x1
2 x
2
0 10 20 30 40 50
x2

Hyperplane in 3D
inssvmXaug, y, scaleF, kernellinear, cost10

4 2
0

6 4 2 0 2 4 x1
2 4
6 8
2 x
2
0 10 20 30 40 50
x2

Project back down to 2D

10 5 0 5 10
x1
x2
6 4 2 0 2

Another quadratic example
True classification function: x 2y 21.

4 2 0 2 4
x1
x2
3 2 1 0 1 2 3

Linear SVM
mmhsvmX, y, scaleF, kernellinear, cost10

x2
3 2 1 0 1 2 3
4 2 0 2 4
x1

Augment the feature space
XaugcbindX, X,12X,22
inssvmXaug, y, scaleF, kernellinear, cost1e6

2
34

1
4 3 2 1 4 3 2 1 0 1 2 3 4

0
x1
x 21x 2 2
0 5 10 15 20
x2

Project back down
x2
3 2 1 0 1 2 3
4 2 0 2 4
x1

Linear SVMssolution
Theorem
The solution takes the form
Proof.
N
nxn.
n1
Supposeis of the given form but that the solution is of the formv with v xn0; v is orthogonal to any xn. Then:
ynxnynxn1n,
and, by Pythagoras, 22 v22. In other words,satisfies the constraints and leads to a smaller value of the objective function, socannot have been optimal, a contradiction.

Linear SVMssolution
N
nxn.
n1
The solution coefficients, 1,,N, are found by numerical convex optimization techniques a generalization of Lagrange multipliers allowing the constraints to be inequalities rather than equalities.
In fact we have n0 only for the support vectors.
For nonlinear SVMs we have, by the same argument,
N
nxn,
n1
where: RDH is a feature mapping.
In our earlier example we had: R2R3 with
x1x1,x2 x2 .
x 12x 2 2

The kernel trick
Given the previous theorem, reexpress
Minimize 12C n n subject to:
2
n, n0, and ynxn1n,
as
Find 1,,N and 1,,N0 such that
1NN N
2 mnxm xnCn is minimal subject to
m1 n1 n1 N
n, ynmxn xm1n. m1
Amazing observation! The algorithm depends on the xn1nN only through the dot products xn xm1m,nN.

Kernel tricknonlinear case
Let K : RDRDR denote the kernel function Kx,xxx.
As long as we can compute K, we never need to evaluate !
Ks are much simpler than s.
The dimension of feature space H can be arbitrarily large even infinite!

Kernels
Linear Kernel
Kx,xxx.
Polynomial Kernel
Kx,xc x xp, c0, p2,3,.
Radial Basis Function RBF Kernel
Kx,xexpx x2, 0.
How to choose c and ? Use validation.

Polynomial kernel
Kx,xc x xp
dat.frdata.framecbindy, X
dat.fryfactorifelseyTRUE,1,1, levelsc1,1
inssvmy.,dat.fr,kernelpolynomial, degree2,
gamma1, coef00, cost1
2
0
2
x xx xx xx
SVM classification plot
x xx x x
x x
x x x
xx xx
x
x x
xx xx
x xx
x xxx xxx
x1
1 1
3 2 1 0 1 2 3
x2

Polynomial kernel
For a simple quadratic c0, 1, p2: Kx,xx x2,
with D2 columns:
x12x 2x1x2.
x2
Check that Kx,xxx for this example.
What do decision boundariesx 0 look like now, when viewed in the original 2D space?

Conic sections
Curves 1122x1x2320:
if24130:ellipseorcircle;
if24130:parallellines;
if24130:hyperbola,ortwointersectinglines;
all centred at the origin.

Quadratic kernel
When c1, Kx,x1x x2: 1
21
x 22 x 12

.
2x1x2 x2
What do hyperplanesx 0 look like now?

Conic sections
General Cartesian form
112 2x1x2 32 41 52 0.
if24130:ellipseorcircle;
if24130:parallellines;
if24130:hyperbola,ortwointersectinglines.
Can shift the curve in the x1 and x2 directions.

RBF kernel
The radial basis function RBF also known as the squared exponential or Gaussian kernel:
Kx,xexpx x2.
This ranges from 1 when xx to 0 in the limit; andhas corresponding x which is infinitedimensional.

Another example

1.5 1.0 0.5 0.0 0.5 1.0 1.5
x2
x1
0.5 0.0 0.5

Linear kernel
ssvmy.,d,kernellinear
pasteacc:,meanpredicts,dy,SV:, nrowsSV
plots,d,grid200,asp1
1 acc: 0.87 SVS:VM3c5lassification plot
o o oo
0.5
o o o o o o o ooo
xxx x o xxxo
x
xx x x oooo
xx xx x o oxxxx
0.0o xo oxo
0.5
oo o ooo
1.5
ooo ooo o ooo
oo
oo o ooooo
xo
oxo
oo o
x x x x
o xx x
ox x
1.0 0.5 0.0 0.5 1.0 1.5
x1
1 1
x2

Quadratic kernel
ssvmy.,d,kernelpolynomial,degree2,coef01
pasteacc:,meanpredicts,dy,SV:, nrowsSV
plots,d,grid200,asp1
1 acc: 0.87 SVS:VM3c5lassification plot
o o oo
0.5
o o o o o o o ooo
xoo x o xxxo
x
xx x x oooo
ox xx x o ooxxx
0.0o xo oxo
0.5
ox o ooo
1.5
ooo ooo o ooo
oo
oo o ooooo
xx
oxx
oo o
x x x x
o xx x
x x
1.0 0.5 0.0 0.5 1.0 1.5
x1
1 1
x2

Cubic kernel
ssvmy.,d,kernelpolynomial,degree3,coef01
pasteacc:,meanpredicts,dy,SV:, nrowsSV
plots,d,grid200,asp1
1 acc: 0.98 SVS:VM2c4lassification plot
o o oo
0.5
o x o o o o o xoo
ooo x o oooo
o
oo o x xooo
oo ox x o oxxxx
0.0o xo oxo
0.5
oo x ooo
1.5
ooo ooo o ooo
oo
oo o ooooo
xx
ooo
oo x
x ox o o
x xx x
o x
1.0 0.5 0.0 0.5 1.0 1.5
x1
1 1
x2

Polynomial kernel, degree 7
ssvmy.,d,kernelpolynomial,degree7,coef01
pasteacc:,meanpredicts,dy,SV:, nrowsSV
plots,d,grid200,asp1
1 acc: 1 SV: 1S4VM classification plot
o o oo
0.5
o o o o o o o xoo
oox x o oooo
o
ox o o xooo
oo oo x o oooxo
0.0o oo ooo
0.5
oo x ooo
1.5
1.0 0.5 0.0 0.5 1.0 1.5
o xx
x
o x
ooo ooo o ooo
oo
oo o ooooo
ox
ooo
oo o
x o o o
x1
1 1
x2

RBF kernel
ssvmy.,d,kernelradial,gamma1
pasteacc:,meanpredicts,dy,SV:, nrowsSV
plots,d,grid200,asp1
1 acc: 0.99 SVS:VM3c0lassification plot
o o xo
0.5
o x o o o o o xxo
oox x o xooo
o
ox o x xxoo
oo ox x o oxoox
0.0o oo xoo
0.5
oo x ooo
1.5
1.0 0.5 0.0 0.5 1.0 1.5
x xx
x
xo x
ooo oxo o xoo
oo
oo o ooooo
xx
oxo
oo x
x o o o
x1
1 1
x2

Summary
We use kernel methods in SVM to obtain a nonlinear decision boundary:
linear kernel Kx,xx x
polynomial kernel Kx,xc x xp
RBF kernel Kx,xexpx x2
Each of these corresponds to an inner product space, but in practice we never need to evaluate x to compute this mapping.
We can use crossvalidation to select the tuning parameters.

Supervised classification
SVM is a deterministic classifier.
Either a linear or nonlinear decision boundary kernel method.
Fit parameters by convex optimisation quadratic
programming.
Connections with other methods an incomplete list:
Logistic regression
linear, probabilisticconvex optimisation
k nearest neighbours
piecewise constant, either deterministic or probabilistic
Artificial neural networks
nonlinear, probabilisticnonconvex optimisation
Gaussian processes
nonlinear, probabilisticoptimisation or MCMC

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[SOLVED] R C algorithm parallel statistic network theory Support vector machines

Reviews

Whatsapp Us

[SOLVED] R C algorithm parallel statistic network theory Support vector machines

Reviews

Related products

[Solved] Indel

[SOLVED] Basic Calculator Client and Server

[SOLVED] COP 3223 Program #4: Turtle Time and List Power

[Solved] Modify your first program to print a table of the words

[SOLVED] pakudex

[Solved] BinaryAdd