Problem 10.1
CS5487 Problem Set 10
Kernels
Antoni Chan Department of Computer Science City University of Hong Kong
Kernel functions
Constructing kernels from kernels
Suppose k1x,z and k2x,z are valid kernels. Prove the following kernels are also valid kernels:
a kernel scaling: b sum:
c product:
d input scaling: e polynomial: f exponential:
kx, zck1x, z,
kx, zk1x, zk2x, z. kx, zk1x, zk2x, z. kx,zfxk1x,zfz, kx, zk1x, zq,
kx, zexpk1x, z.
where c0.
for any function f.
where q is a positive integer.
Problem 10.2 Kernels on real vectors
Let x, z 2 Rn. Show the following are valid kernels,
a Gaussian or RBF: b exponential:
c linear:
d polynomial:
e cosine:
kx,z exp kxzk2, kx,z exp kxzk, kx,z xT Az,
kx,z xT z1q , kx,z xTz .
for 0.
for 0.
where A is a positive definite matrix. where q is a positive integer.
kxkkzk 2 xz
kx,zexpsin2x,z 2 R.
f periodic:
Forf,considerthewarpinguxsinx,cosxT andthefactthatkuxuzk2 4sin2xz.
What are the feature transformations corresponding to each of these kernels?
Problem 10.3 Constructing kernels from kernels part 2feature selection
Suppose x,z 2 Rd, x is a function from x to RM, and k3, is a valid kernel in RM. Prove the following is a valid kernel:
a kx, zk3x, z.
Let xa and xb be variables with xxa, xb, and ka and kb are valid kernels over their respective
spaces. Show that the following are valid kernels:
b kx, zkaxa, zakbxb, zb.
c kx, zkaxa, zakbxb, zb.
73
2
Problem 10.4 Constructing kernels from kernels part 3normalization
Some kernels are poorly scaled, in the sense that the dynamic range of the values is much less than the absolute values. One solution to this problem is to normalize the kernel,
k x,zp kx,z . 10.1 kx, xkz, z
This will set the selfsimilarity value to kx, x1.
a Show that the normalized kernel kis a valid kernel.
b Let x be the feature transformation associated with the kernel k. Show that kis equivalent to calculating the cosine between x and z in the highdimensional feature space.
c What is the range of possible values of the normalized kernel k x, z?
Problem 10.5 Constructing kernels from kernels part 4kernel distance
If kx, z is a valid kernel, then it can be interpreted as a dot product. Hence, we can also construct a distance norm based on this dot product.
kxzk2 xTx2xTzzTzd2x,zkx,x2kx,zkz,z. 10.2 This squareddistance can be used in place the standard L2 norm.
a Show that the exponential kernel distance is a valid kernel:
kx, zexpkx, x2kx, zkz, z. 10.3 This is similar to a Gaussian kernel, but with the L2norm replaced with the kernel distance.
Problem 10.6 Kernels for histograms
Let x and z be ddimensional histograms, i.e., x, z 2 Rd, where each bin xi, zi0. Show that the following kernels between histograms are valid kernels:
a correlation:
b Bhattacharyya:
c histogram intersection: d 2RBF:
kx, zPdi1 xizi.
kx, zPdi1 pxipzi. kx, zdi1 minxi, zi.
74
kx, zexp 2x, z , 2x, zPd xizi2 .
i1 1xizi 2
Problem 10.7 Kernels on sets
Let Xx1, ,xM and Zz1, ,zP be set, where x,z are the set items these could be real vectors or symbols. Define a kernel between items as kx, z.
a Show that the kernel between sets X and Z,
k X, ZX X kx, z,
x2X z2Z
is a valid kernel. Hint: consider the feature transformation X
the feature map induced by kx, z.
b Show that the kernel
k X, Z1 X X kx, zq, XZ x2X z2Z
P 10.4 Mi1 xi, where xi is
10.5
is a valid kernel, where q is a nonnegative integer. This is called an exponent match kernel. c Show that the kernel, which counts the number of common elements, is a valid kernel,
k X, ZXZ. 10.6 d Show that the following kernel is also valid,
k X, Z2XZ. 10.7 Another kernel on sets is the pyramid match kernel. There are also other kernels that are not
10.8
10.9
10.10
Consider that the two inputs are distributions, px and qx. The correlation is a valid kernel between distributions Z
positive definite, which have the form using the sumofminimum distances,
kX, ZedX,Z2
dX,Z1 Xmindxi,zjXmindzi,xj
XZ i j i j or the maxofminimum distances Haussdorf distance,
dX,Zmaxmaxmindxi,zj,maxmindzi,xj. ij ij
Problem 10.8 Kernels on distributions
kp, qpxqxdx. 10.11
kp,qZ pxqxdx 10.12 is a valid kernel, for any 0. This is called the probability product kernel. A special case is
the Bhattacharyya kernel 0.5, and the correlation kernel 1.
75
a Show that the kernel
Problem 10.9 Kernels on sample spaces
Let X be a sample space of random variable x,Zand A and B events, where pApxdx
10.13
define the kernel
x2A
k : XX ! 1, 1
kA, BpABpApB
a Show that kA,B is a valid kernel. Hint: consider the feature transformation A1A
pA, where 1A is the function that takes the value 1 on the set A, and 0 otherwise.
Kernel trick
Problem 10.10 Kernel SVM bias term
10.14 10.15
The bias term for the linear SVM can be calculated as
b1 Xyi wTxi
10.16
SVi2SV where SVii0 is the set of support vectors.
a Derive an expression for b for the kernel SVM.
Problem 10.11 Kernel logistic regression
Consider the twoclass regularized logistic regression from Problem 8.4, which is interpretable as MAP estimation. The prior distribution on w is zeromean Gaussian with known precision matrixi.e., inverse of the covariance matrix,
pwN w0, 1. Given the training set DX, y, the MAP estimate is
wargmax log pyX, wlog pw, w
which can be calculated using the NewtonRaphson method, with the iterations wnewXRXT1XRz,
where R and z are calculated from the previous wold,
ixTi wold, 1, ,nT, Rdiag111,,n1n,
1e
of logistic regression to obtain kernel logistic regression.
76
10.17 10.18
10.19
zXT woldR1y.
a1a is the logistic sigmoid function. In this problem, we will kernelize the MAP version
10.20 10.21 10.22
a Define wT x as the linear function of a novel input x. Use the form in 10.19 to show that the linear function wT x can be kernelized when z is known,
wT xkT KR11z 10.23
where kkx, x1,, kx, xnT is the test kernel, and Kkxi, xj ij is the training kernel matrix. Hence, the probability of class 1 for a new point x is . Hint: use a matrix inverse identity Problem 1.15, and note that xTi 1xj is an inner product.
b Using 10.23, show that the z in 10.22 can be calculated iteratively from zold,
oldKKR11zold 10.24
zoldR1y, 10.25 where iold. Hence, the NewtonRaphson iterations can also be kernelized for training.
c What happened to the prior covariance 1? What is the relationship between the prior and the kernel function, e.g., when I?
d Is the scale of the kernel important? In other words, is using kxi,xj equivalent to using kxi,xjkxi,xj, for some 0? Why or why not?
Problem 10.12 Gaussian process regressionnonlinear Bayesian regression
In this problem we will apply the kernel trick to Bayesian linear regression Problem 3.10 to obtain a nonlinear Bayesian regression algorithm, known as Gaussian process regression. The prior distribution on the linear weights is Gaussian, pN 0, . The posterior distribution of the parameters is Gaussian
pDN,,
11T 11y,
11T 1,
whereis the posterior mean andis the posterior covariance. Given a novel input x, the
predictive distribution of the corresponding output ffx, is also Gaussian
pfx, DN f, 2, xT ,
2xT x,
where the posterior meanand covarianceare given in 10.27 and 10.28. We will assume
that the observation noise is i.i.d, i.e., 2I.
a Show that the predictive meancan be written in the form,
T T 2I1y, 10.32 where x. Hint: use a matrix inverse identity from Problem 1.15.
77
i
10.26 10.27 10.28
10.29 10.30 10.31
b Show that the predictive variance 2 can be written in the form,
2T T T 2I1. 10.33
Hint: use a matrix inverse identity from Problem 1.15.
c The kernel trick can be applied to a and b, by defining setting kxi,xjxiTxj
yielding,
kT K2I1y 10.34 2kkT K2I1k, 10.35
where Kkxi,xjij is the kernel matrix, kkx,xii is the test kernel vector, and kkx, x. What happened to the prior covariance ? What is the relationship between the prior covarianceand the kernel function kxi,xj?
d Define zK2I1y. The predictive mean , which is a function of x, can then be written as
Xn i1
where zi is the ith element of z. Hence, the regressed function is a linear combination of kernel functions kx,xi, where xi is fixed. What will be the general shape of the regressed function for the following kernels?
linear: kxi,xjxTi xj.
polynomial: kxi,xjxTi xj 12.
RBF: kxi,xj1e2kxixjk2.
periodic: kxi,xj1 exp2 sin2xixj . 2
are the kernel parameters.
e 2 is the variance of the predictive distribution of the function value f, which corresponds to the uncertainty of the prediction high variance means high uncertainty. Under what conditions will the variance be large? When will the variance be small?
x
zikx, xi, 10.36
78
Problem 10.13 Kernel perceptron
For a training set Dx1,y1,,xn,yn, the Perceptron algorithm can be implemented as follows.
Algorithm 1 Perceptron algorithm set w0,b0,Rmaxi xi repeat
for i1,,n do
if yiwT xib0 then
set w wyixi
set b byiR2 end if
end for
until there are no classification errors
The final classifier is ysignwT xb.
a Is the learning raterelevant? Why?
b Show that w learned by the perceptron algorithm must take the form wPni1 iyixi, where i0, 8i.
c Using b, show that an equivalent Perceptron algorithm the dual Perceptron is:
Algorithm 2 Perceptron algorithm dual set 0,b0,Rmaxi xi
repeat
for i P1,,n do
ifyi nj1jyjxTjxib0then
set i i1
set b byiR2 end if
end for
until there are no classification errors
d Can you give an interpretation to the parameters i? Which among the samples xi are the hardest to classify?
e Apply the kernel trick the the dual perceptron algorithm to obtain the kernel perceptron algo rithm. What is the kernelized decision function?
79
Problem 10.14 Kernel kmeans
In this problem we will kernelize the kmeans algorithm.
Consider the original kmeans algorithm. Given a set of points Xx1,, xn, where
xi 2 Rd, the goal is to assign a cluster label to each point, yi 2 1, ,K, where K is the number of clusters. The kmeans algorithm calculates the cluster centers j using the current assignment to each cluster K is assumed to be known. In each iteration, Kmeans performs the following two steps:
Cluster Assignment : zij
Estimate Center : jPn z .
1, jargmink21, ,K kxikk2 P0, otherwise.
10.37
10.38
ni1 z i j x i i1 ij
The cluster label for point xi is the label of the closest cluster center, yiargmaxj zij.
The disadvantage of kmeans is that it only works when the clusters can be separated by a hyperplane. This is a consequence of using the Euclidean distance to measure the distance to the cluster center. To apply the kernel trick, we need to rewrite kmeans so that it only uses
innerproducts between the data points xi.
a Show that the squared distance from x to the cluster center k can be written only using
innerproducts as
dx, k kxk k2xT x2 N
1 Xn Xn
zikkx,xlN2 zlkzmkkxl,xm. 10.40
1 Xn
1 Xn Xn
zlk xT xlN 2 zlk zmk xTl xm 10.39
k l1
b Apply the kernel trick to 10.39 to obtain the kernelized square distance,
k l1 m1 where NkPnl1 zmk is the number of points in cluster k.
2 1 Xn
dx,kkxkk kx,x2N
k l1
k l1 m1
c What is the interpretation of the distance in 10.40 when kx,x0 is the Gaussian kernel?
When will the distance be small?
d Show that the kernel kmeans algorithm is:
CalculateDistances: dxi,kkxi,xi2N
k l1
Cluster Assignment : zij1, jargmink21, ,K dxi, k 0, otherwise.
A weighted version of Kernel kmeans has interesting connection to many spectral clustering and graph clustering methods see I.S. Dhillon, Y. Guan, B. Kulis, Kernel kmeans: spectral clustering and normalized cuts. Proc. ACM SIGKDD, 2004.
80
1 Xn
1 Xn Xn zikkxi,xlN2
zlkzmkkxl,xm,8i 10.41
10.42
k l1 m1
Problem 10.15 Kernel discriminant analysis
In this problem we will derive a kernel version of Fishers linear discriminant FLD, see Problem 7.6 Let X1x1, ,xn1 and X2x1, ,xn2 be the matrix of feature vectors from class 1 and class 2, and n1 and n2 are the number of feature vectors for class 1 and class 2, respectively. The class mean and scatter matrix are given by
nj
Sj Xxi jxi jT. 10.43
withinclass scatter,
nj
j1 Xxi,
nj i1
FLD finds the optimal projection that maximizes the ratio of the betweenclass scatter and the
w argmaxJw, Jw wTSBw, w wTSWw
10.44
i1
where SB and SW are the between and withinclass scatter matrices, SB 1 21 2T, SW S1 S2.
10.45 Let XX1,X2 be all the data. To do the kernelization, we will first assume that the
The optimal projection is wS112. W
optimal w can be written as a linear combination of the data points,
Xn i1
a Why is the assumption in 10.46 valid?
b Show that the class mean can be written as
wwhere 1, ,n are the weights.
ixiX,
10.46
10.47
10.48
and the scatter matrix as
j1 Xj1, nj
Sj XjI 1 11TXjT, nj
where 1 is a vector of ones.
c Using b, show the mean and scatter matrices when multiplied by w can be written in terms
of ,
where
wTj Tj, wTSjwTSj j1XTXj1, SjXTXjI111TXjTX
10.49
10.50
nj nj 81
d Apply the kernel trick to j and Sj to obtain expressions, j1Kj1, SjKjI111TKj,
10.51
10.52
10.53
10.54
nj nj where Kj is the kernel matrix between X and Xj, i.e.,
xi 2 X, xi 2 Xj. J TSB,
f Show that the optimalis given by
Kji,i0kxi,xi0, e Show that the kernelized version of 10.44 is
argmaxJ,
where
T S W
SB2121T , SWS1S2.
g Finally, show that the kernelized projection of x is
Xn i1
10.56
S1 2 1. W
10.55 Hint: similar to the original problem, only the direction ofmatters, and not the magnitude.
z
82
ikx, xi.
Reviews
There are no reviews yet.