[SOLVED] R C algorithm Scheme html math scala database graph statistic network Bayesian theory C. E. Rasmussen C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml

$25

File Name: R_C_algorithm_Scheme_html_math_scala_database_graph_statistic_network_Bayesian_theory_C._E._Rasmussen__C._K._I._Williams,_Gaussian_Processes_for_Machine_Learning,_the_MIT_Press,_2006,_ISBN_026218253X._c_2006_Massachusetts_Institute_of_Technology._www.GaussianProcess.orggpml.zip
File Size: 2581.08 KB

5/5 - (1 vote)

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
Chapter 4
Covariance Functions
We have seen that a covariance function is the crucial ingredient in a Gaussian process predictor, as it encodes our assumptions about the function which we wish to learn. From a slightly different viewpoint it is clear that in supervised learning the notion of similarity between data points is crucial; it is a basic assumption that points with inputs x which are close are likely to have similar target values y, and thus training points that are near to a test point should be informative about the prediction at that point. Under the Gaussian process view it is the covariance function that defines nearness or similarity.
An arbitrary function of input pairs x and x will not, in general, be a valid covariance function.1 The purpose of this chapter is to give examples of some commonlyused covariance functions and to examine their properties. Section 4.1 defines a number of basic terms relating to covariance functions. Section 4.2 gives examples of stationary, dotproduct, and other nonstationary covariance functions, and also gives some ways to make new ones from old. Section 4.3 introduces the important topic of eigenfunction analysis of covariance functions, and states Mercers theorem which allows us to express the covariance function under certain conditions in terms of its eigenfunctions and eigenvalues. The covariance functions given in section 4.2 are valid when the input domain X is a subset of RD. In section 4.4 we describe ways to define covariance functions when the input domain is over structured objects such as strings and trees.
4.1 Preliminaries
A stationary covariance function is a function of xx. Thus it is invariant to translations in the input space.2 For example the squared exponential co
1To be a valid covariance function it must be positive semidefinite, see eq. 4.2.
2In stochastic process theory a process which has constant mean and whose covariance function is invariant to translations is called weakly stationary. A process is strictly sta tionary if all of its finite dimensional distributions are invariant to translations Papoulis, 1991, sec. 10.1.
similarity
valid covariance functions
stationarity

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
80
isotropy
dot product covariance
kernel
Covariance Functions
variance function given in equation 2.16 is stationary. If further the covariance function is a function only of xx then it is called isotropic; it is thus in variant to all rigid motions. For example the squared exponential covariance function given in equation 2.16 is isotropic. As k is now only a function of rxx these are also known as radial basis functions RBFs.
If a covariance function depends only on x and x through xx we call it a dot product covariance function. A simple example is the covariance function kx, x02xx which can be obtained from linear regression by putting N0,1 priors on the coefficients of xd d1,,D and a prior of N0,02 on the bias or constant function 1, see eq. 2.15. Another important example is the inhomogeneous polynomial kernel kx, x02xxp where p is a positive integer. Dot product covariance functions are invariant to a rotation of the coordinates about the origin, but not translations.
A general name for a function k of two arguments mapping a pair of inputs xX, xX into R is a kernel. This term arises in the theory of integral operators, where the operator Tk is defined as

X
wheredenotes a measure; see section A.7 for further explanation of this point.3 A real kernel is said to be symmetric if kx,xkx,x; clearly covariance functions must be symmetric from the definition.
Given a set of input points xii1,,n we can compute the Gram matrix K whose entries are Kijkxi,xj. If k is a covariance function we call the matrix K the covariance matrix.
A real nn matrix K which satisfies QvvKv0 for all vectors vRn is called positive semidefinite PSD. If Qv0 only when v0 the matrix is positive definite. Qv is called a quadratic form. A symmetric matrix is PSD if and only if all of its eigenvalues are nonnegative. A Gram matrix corresponding to a general kernel function need not be PSD, but the Gram matrix corresponding to a covariance function is PSD.
A kernel is said to be positive semidefinite if

for all fL2X,. Equivalently a kernel function which gives rise to PSD Gram matrices for any choice of nN and D is positive semidefinite. To see this let f be the weighted sum of delta functions at each xi. Since such functions are limits of functions in L2X, eq. 4.2 implies that the Gram matrix corresponding to any D is PSD.
For a onedimensional Gaussian process one way to understand the charac teristic lengthscale of the process if this exists is in terms of the number of upcrossings of a level u. Adler 1981, Theorem 4.1.1 states that the expected
3Informally speaking, readers will usually be able to substitute dx or pxdx for dx.
Gram matrix covariance matrix
positive semidefinite
upcrossing rate
Tkfx
kx,xfxdx, 4.1
kx, xfxfx dx dx0, 4.2

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.2 Examples of Covariance Functions
number of upcrossings ENu of the level u on the unit interval by a zeromean, stationary, almost surely continuous Gaussian process is given by

81
ENu1 2
k0 exp u2 . 4.3 k0 2k0
If k0 does not exist so that the process is not mean square differentiable then if such a process has a zero at x0 then it will almost surely have an infinite number of zeros in the arbitrarily small interval x0,x0Blake and Lindsey, 1973, p. 303.
4.1.1 Mean Square Continuity and Differentiability
We now describe mean square continuity and differentiability of stochastic pro cesses, following Adler 1981, sec. 2.2. Let x1,x2, be a sequence of points andx beafixedpointinRD suchthatxkx0ask. Thena process fx is continuous in mean square at x if Efxkfx20 as k. Ifthisholdsforallx AwhereAisasubsetofRD thenfxissaid to be continuous in mean square MS over A. A random field is continuous in mean square at x if and only if its covariance function kx,x is continuous at the point xxx. For stationary covariance functions this reduces to checking continuity at k0. Note that MS continuity does not necessarily imply sample function continuity; for a discussion of sample function continuity and differentiability see Adler 1981, ch. 3.
The mean square derivative of fx in the ith direction is defined as
fxl.i.mfxheifx, 4.4
xi h0 h
when the limit exists, where l.i.m denotes the limit in mean square and ei
is the unit vector in the ith direction. The covariance function of fxxi is given by 2kx,xxixi. These definitions can be extended to higher order derivatives. For stationary processes, if the 2kthorder partial derivative 2kkx2xi1 . . . 2xik exists and is finite at x0 then the kth order partial derivative kfxxi1 xik exists for all xRD as a mean square limit. Notice that it is the properties of the kernel k around 0 that determine the smoothness properties MS differentiability of a stationary process.
4.2 Examples of Covariance Functions
In this section we consider covariance functions where the input domain X is a subset of the vector space RD. More general input spaces are considered in section 4.4. We start in section 4.2.1 with stationary covariance functions, then consider dotproduct covariance functions in section 4.2.2 and other varieties of nonstationary covariance functions in section 4.2.3. We give an overview of some commonly used covariance functions in Table 4.1 and in section 4.2.4

mean square continuity
mean square differentiability

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
82
Covariance Functions
we describe general methods for constructing new kernels from old. There exist several other good overviews of covariance functions, see e.g. Abrahamsen 1997.
4.2.1 Stationary Covariance Functions
In this section and section 4.3 it will be convenient to allow kernels to be a map from xX , xX into C rather than R. If a zeromean process f is complex valued, then the covariance function is defined as kx,xEfxfx, wheredenotes complex conjugation.
A stationary covariance function is a function of xx. Sometimes in this case we will write k as a function of a single argument, i.e. k.
The covariance function of a stationary process can be represented as the Fourier transform of a positive finite measure.
Theorem 4.1 Bochners theorem A complexvalued function k on RD is the covariance function of a weakly stationary mean square continuous complex valued random process on RD if and only if it can be represented as

kwhereis a positive finite measure.
Bochners theorem
spectral density power spectrum
RD
e2is ds 4.5
The statement of Bochners theorem is quoted from Stein 1999, p. 24; a proof can be found in Gihman and Skorohod 1974, p. 208. Ifhas a density Ss then S is known as the spectral density or power spectrum corresponding to k.
The construction given by eq. 4.5 puts nonnegative power into each fre quency s; this is analogous to the requirement that the prior covariance matrix p on the weights in equation 2.4 be nonnegative definite.
In the case that the spectral density Ss exists, the covariance function and the spectral density are Fourier duals of each other as shown in eq. 4.6;4 this is known as the WienerKhintchine theorem, see, e.g. Chatfield 1989

kSse2is ds, Sske2is d. 4.6 Notice that the variance of the process is k0 Ss ds so the power spectrum
must be integrable to define a valid Gaussian process.
To gain some intuition for the definition of the power spectrum given in eq. 4.6 it is important to realize that the complex exponentials e2isx are eigenfunctions of a stationary kernel with respect to Lebesgue measure see section 4.3 for further details. Thus Ss is, loosely speaking, the amount of power allocated on average to the eigenfunction e2isx with frequency s. Ss must eventually decay sufficiently fast as s so that it is integrable; the
4See Appendix A.8 for details of Fourier transforms.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.2 Examples of Covariance Functions
rate of this decay of the power spectrum gives important information about the smoothness of the associated stochastic process. For example it can deter mine the meansquare differentiability of the process see section 4.3 for further details.
If the covariance function is isotropic so that it is a function of r, where r then it can be shown that Ss is a function of ss only Adler, 1981, Theorem 2.5.2. In this case the integrals in eq. 4.6 can be simplified by changing to spherical polar coordinates and integrating out the angular variables see e.g. Bracewell, 1986, ch. 12 to obtain
4.7
4.8
83
krrD21
2
2
SsJD212rssD2 ds,
krJD212rsrD2 dr,
SssD21
0
0
where JD21 is a Bessel function of order D21. Note that the dependence on the dimensionality D in equation 4.7 means that the same isotropic functional form of the spectral density can give rise to different isotropic covariance func tions in different dimensions. Similarly, if we start with a particular isotropic covariance function kr the form of spectral density will in general depend on D see, e.g. the Mat ern class spectral density given in eq. 4.15 and in fact kr may not be valid for all D. A necessary condition for the spectral density to exist is thatrD1kr dr; see Stein 1999, sec. 2.10 for more details.
We now give some examples of commonlyused isotropic covariance func tions. The covariance functions are given in a normalized form where k01; we can multiply k by a positive constant f2 to get any desired process vari ance.
Squared Exponential Covariance Function
The squared exponential SE covariance function has already been introduced in chapter 2, eq. 2.16 and has the form
kSErexp r2 , 4.9 2l2
with parameter l defining the characteristic lengthscale. Using eq. 4.3 we see that the mean number of levelzero upcrossings for a SE process in 1 d is 2l1, which confirms the role of l as a lengthscale. This covari ance function is infinitely differentiable, which means that the GP with this covariance function has mean square derivatives of all orders, and is thus very smooth. The spectral density of the SE covariance function is Ss2l2D2 exp22l2s2. Stein 1999 argues that such strong smoothness assumptions are unrealistic for modelling many physical processes, and rec ommends the Mat ern class see below. However, the squared exponential is probably the most widelyused kernel within the kernel machines field.
squared exponential
characteristic lengthscale

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
84
infinitely divisible
infinite network construction for SE covariance function
Covariance Functions
The SE kernel is infinitely divisible in that krt is a valid kernel for all t0; the effect of raising k to the power of t is simply to rescale l.
We now digress briefly, to show that the squared exponential covariance function can also be obtained by expanding the input x into a feature space defined by Gaussianshaped basis functions centered densely in xspace. For simplicity of exposition we consider scalar inputs with basis functions
xc2 2l2
where c denotes the centre of the basis function. From sections 2.1 and 2.2 we recall that with a Gaussian prior on the weights wN0,p2I, this gives rise to a GP with covariance function
N
kxp,xqp2cxpcxq. 4.11
c1
Now, allowing an infinite number of basis functions centered everywhere on an interval and scaling down the variance of the prior on the weights with the number of basis functions we obtain the limit
2 N
c max
cmin
lim p N N c1

cxpcxqp2
cxpcxqdc. 4.12
cxexp
, 4.10
Plugging in the Gaussianshaped basis functions eq. 4.10 and letting the in tegration limits go to infinity we obtain
2 kxp,xqp
xqc2 2l2
2 times longer lengthscale. The derivation is adapted from MacKay 1998. It is straightforward to generalize this construction to multivariate x. See also eq. 4.30 for a similar construction where the centres of the basis functions are sampled from a Gaussian distribution; the constructions are equivalent when
exp
2 xpxq
xpc2 2l2
dc
exp 2

4.13

lp exp22l2 ,
Mat ern class
the variance of this Gaussian tends to infinity.
The Mat ern Class of Covariance Functions
The Mat ern class of covariance functions is given by
which we recognize as a squared exponential covariance function with a
212r2r
kMaternr l K
l ,
4.14
with positive parametersand l, where K is a modified Bessel function Abramowitz and Stegun, 1965, sec. 9.6. This covariance function has a spectral density
2DD2D22 2 2 2D2
Ssl2 l2 4 s 4.15

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.2 Examples of Covariance Functions 85
0
2
1
0.80.6
0.4
0.2
12
2 2
covariance, kr
output, fx
0
0 1 2 3 5 0 5
input distance, r input, x
a b
Figure 4.1: Panel a: covariance functions, and b: random functions drawn from Gaussian processes with Mat ern covariance functions, eq. 4.14, for different values of , with l1. The sample functions on the right were obtained using a discretization of the xaxis of 2000 equallyspaced points.
in D dimensions. Note that the scaling is chosen so that forwe obtain the SE covariance function er2 2l2 , see eq. A.25. Stein 1999 named this the Mat ern class after the work of Mat ern 1960. For the Mat ern class the process fx is ktimes MS differentiable if and only if k. The Mat ern covariance functions become especially simple whenis halfinteger: p12, where p is a nonnegative integer. In this case the covariance function is a product of an exponential and a polynomial of order p, the general expression can be derived from Abramowitz and Stegun, 1965, eq. 10.2.15, giving
p
2r p1pi!8rpi
l 2p1
kp12rexp
It is possible that the most interesting cases for machine learning are 32
and 52, for which
i!pi! l . 4.16 k32r1 3rexp 3r,
4.17 5r,

k52r1 5r5r2exp
ll
i0
l3l2 l
since for 12 the process becomes very rough see below, and for 72, in the absence of explicit prior knowledge about the existence of higher order derivatives, it is probably very hard from finite noisy training examples to distinguish between values of 72 or even to distinguish between finite values ofand , the smooth squared exponential, in this case. For example a value of 52 was used in Cornford et al., 2002.
OrnsteinUhlenbeck Process and Exponential Covariance Function
The special case obtained by setting 12 in the Mat ern class gives the exponential exponential covariance function krexprl. The corresponding process

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
86
Covariance Functions
covariance
output, fx
OrnsteinUhlenbeck process
exponential
0
0 1 2 3 5 0 5
input distance input, x
a b
Figure 4.2: Panel a covariance functions, and b random functions drawn from Gaussian processes with the exponential covariance function eq. 4.18, for different values of , with l1. The sample functions are only differentiable when 2 the SE case. The sample functions on the right were obtained using a discretization of the xaxis of 2000 equallyspaced points.
is MS continuous but not MS differentiable. In D1 this is the covariance function of the OrnsteinUhlenbeck OU process. The OU process Uhlenbeck and Ornstein, 1930 was introduced as a mathematical model of the velocity of a particle undergoing Brownian motion. More generally in D1 setting 12p for integer p gives rise to a particular form of a continuoustime ARp Gaussian process; for further details see section B.2.1. The form of the Mat ern covariance function and samples drawn from it for 12, 2 andare illustrated in Figure 4.1.
The exponential Covariance Function
The exponential family of covariance functions, which includes both the ex
ponential and squared exponential, is given by
krexprl for 02. 4.18
Although this function has a similar number of parameters to the Mat ern class, it is as Stein 1999 notes in a sense less flexible. This is because the corre sponding process is not MS differentiable except when 2 when it is in finitely MS differentiable. The covariance function and random samples from the process are shown in Figure 4.2. A proof of the positive definiteness of this covariance function can be found in Schoenberg 1938.
Rational Quadratic Covariance Function
The rational quadratic RQ covariance function
r2
kRQr12l2 4.19
rational quadratic
1 0.8 0.6 0.4 0.2
1 1.5 2
3 2 1 0
1 2 3

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.2 Examples of Covariance Functions
1 0.8 0.6 0.4
87
12 2
1 0.2 2
3
0 1 2 3 5 0 5
input distance input, x
a b
Figure 4.3: Panel a covariance functions, and b random functions drawn from Gaussian processes with rational quadratic covariance functions, eq. 4.20, for differ ent values ofwith l1. The sample functions on the right were obtained using a discretization of the xaxis of 2000 equallyspaced points.
with , l0 can be seen as a scale mixture an infinite sum of squared exponential SE covariance functions with different characteristic lengthscales sums of covariance functions are also a valid covariance, see section 4.2.4. Parameterizing now in terms of inverse squared length scales, l2, and putting a gamma distribution on p,1 exp,5 we can add up the contributions through the following integral
0

p,kSErd
1r2r2
where we have set 1l2. The rational quadratic is also discussed by Mat ern 1960, p. 17 using a slightly different parameterization; in our notation the limit of the RQ covariance forsee eq. A.25 is the SE covariance function with characteristic lengthscale l, eq. 4.9. Figure 4.3 illustrates the behaviour for different values of ; note that the process is infinitely MS differentiable for everyin contrast to the Mat ern covariance function in Figure 4.1.
The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution pl of lengthscales l, kr expr22l2pldl. This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see Stein, 1999, sec. 2.10.
Piecewise Polynomial Covariance Functions with Compact Support
A family of piecewise polynomial functions with compact support provide an other interesting class of covariance functions. Compact support means that
5Note that there are several common ways to parameterize the Gamma distributionour choice is convenient here:is the shape andis the mean.
kRQr
expexp2d12l2 ,
4.20

3 2 1 0
scale mixture
piecewise polynomial covariance functions with compact support
covariance
output, fx

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
88
Covariance Functions
D1, q1 D3, q1 D1, q2
positive definiteness
restricted dimension
Figure 4.4: Panel a: covariance functions, and b: random functions drawn from Gaussian processes with piecewise polynomial covariance functions with compact sup port from eq. 4.21, with specified parameters.
the covariance between points become exactly zero when their distance exceeds a certain threshold. This means that the covariance matrix will become sparse by construction, leading to the possibility of computational advantages.6 The challenge in designing these functions is how to guarantee positive definite ness. Multiple algorithms for deriving such covariance functions are discussed by Wendland 2005, ch. 9. These functions are usually not positive definite for all input dimensions, but their validity is restricted up to some maximum dimension D. Below we give examples of covariance functions kppD,qr which are positive definite in RD
k
1 0.8 0.6 0.4 0.2 0
0 0.2 0.4 0.6 0.8 1
input distance, r
a
2
0
2
2
1
0 1 2
input, x
b
kppD,0r1rj, wherejDq1, 2
r1rj1j1r1,
r1rj2j2 4j3r2 3j6r33,
r1rj3j3 9j2 23j 15r3
6j236j45r215j45r1515.
ppD,1 k
4.21
ppD,2 k
ppD,3
The properties of three of these covariance functions are illustrated in Fig ure 4.4. These covariance functions are 2qtimes continuously differentiable, and thus the corresponding processes are qtimes meansquare differentiable, see section 4.1.1. It is interesting to ask to what extent one could use the compactlysupported covariance functions described above in place of the other covariance functions mentioned in this section, while obtaining inferences that are similar. One advantage of the compact support is that it gives rise to spar sity of the Gram matrix which could be exploited, for example, when using iterative solutions to GPR problem, see section 8.3.6.
6If the product of the inverse covariance matrix with a vector needed e.g. for prediction is computed using a conjugate gradient algorithm, then products of the covariance matrix with vectors are the basic computational unit, and these can obviously be carried out much faster if the matrix is sparse.
covariance, kr
output, fx

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.2 Examples of Covariance Functions
Further Properties of Stationary Covariance Functions
The covariance functions given above decay monotonically with r and are always positive. However, this is not a necessary condition for a covariance function. For example Yaglom 1987 shows that krcr J r is a valid covari ance function for D22 and 0; this function has the form of a damped oscillation.
Anisotropic versions of these isotropic covariance functions can be created by setting r2x, xxxMxx for some positive semidefinite M. If M is diagonal this implements the use of different lengthscales on different dimensionsfor further discussion of automatic relevance determination see section 5.1. General Ms have been considered by Mat ern 1960, p. 19, Poggio and Girosi 1990 and also in Vivarelli and Williams 1999; in the latter work a lowrank M was used to implement a linear dimensionality reduction step from the input space to lowerdimensional feature space. More generally, one could assume the form
M 4.22
whereis a Dk matrix whose columns define k directions of high relevance, andis a diagonal matrix with positive entries, capturing the usual axis aligned relevances, see also Figure 5.1 on page 107. Thus M has a factor analysis form. For appropriate choices of k this may represent a good tradeoff between flexibility and required number of parameters.
Stationary kernels can also be defined on a periodic domain, and can be readily constructed from stationary kernels on R. Given a stationary kernel kx, the kernel kTxmZ kxml is periodic with period l, as shown in section B.2.2 and Sch olkopf and Smola 2002, eq. 4.42.
4.2.2 Dot Product Covariance Functions
As we have already mentioned above the kernel kx, x02xx can be obtained from linear regression. If 020 we call this the homogeneous linear kernel, otherwise it is inhomogeneous. Of course this can be generalized to kx,x02 xpx by using a general covariance matrix p on the components of x, as described in eq. 2.4.7 It is also the case that kx,x02xpxp is a valid covariance function for positive integer p, because of the general result that a positiveinteger power of a given covariance function is also a valid covariance function, as described in section 4.2.4. However, it is also interesting to show an explicit feature space construction for the polynomial covariance function. We consider the homogeneous polynomial case as the inhomogeneous case can simply be obtained by considering x to be extended
7Indeed the bias term could also be included in the general expression.
89
anisotropy
factor analysis distance
periodization

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
90
Covariance Functions
by concatenating a constant. We write
DDD
p p
kx,xxxxdxdd1
d11
xd1 xdpxd1 xdpxx. 4.23
DD d1 1 dp 1
Notice that this sum apparently contains Dp terms but in fact it is less than this
as the order of the indices in the monomial xd1xdp is unimportant, e.g. for
p2, x1x2 and x2x1 are the same monomial. We can remove the redundancy
by defining a vector m whose entry md specifies the number of times index
d appears in the monomial, under the constraint that Di1 mip. Thus
mx, the feature corresponding to vector m is proportional to the monomial
xm1 . . . xmD . The degeneracy ofx is p! where as usual we define 1 D m m1!mD!
xd1xd1xdpxdp dp1
0!1, giving the feature map
mxm1!mD!x1 xD . 4.24
p! m1 mD
For example, for p2 in D2, we have xx21,x2,2x1x2. Dot
product kernels are sometimes used in a normalized form given by eq. 4.35.
For regression problems the polynomial kernel is a rather strange choice as the prior variance grows rapidly with x for x1. However, such kernels have proved effective in highdimensional classification problems e.g. take x to be a vectorized binary image where the input data are binary or greyscale normalized to 1, 1 on each dimension Sch olkopf and Smola, 2002, sec. 7.8.
4.2.3 Other Nonstationary Covariance Functions
Above we have seen examples of nonstationary dot product kernels. However, there are also other interesting kernels which are not of this form. In this section we first describe the covariance function belonging to a particular type of neural network; this construction is due to Neal 1996.
Consider a network which takes an input x, has one hidden layer with NH units and then linearly combines the outputs of the hidden units with a bias b to obtain fx. The mapping can be written
NH
fxbvjhx;uj, 4.25
j1
where the vjs are the hiddentooutput weights and hx;u is the hidden unit transfer function which we shall assume is bounded which depends on the inputtohidden weights u. For example, we could choose hx; utanhxu. This architecture is important because it has been shown by Hornik 1993 that networks with one hidden layer are universal approximators as the number of

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.2 Examples of Covariance Functions
hidden units tends to infinity, for a wide class of transfer functions but exclud ing polynomials. Let b and the vs have independent zeromean distributions of variance b2 and v2, respectively, and let the weights uj for each hidden unit be independently and identically distributed. Denoting all weights by w, we obtain following Neal 1996
91
Ewfx0
Ewfxfxb2 v2Euhx;ujhx;uj
j
b2 NHv2Euhx;uhx;u,
4.26 4.27
4.28
where eq. 4.28 follows because all of the hidden units are identically dis tributed. The final term in equation 4.28 becomes 2Euhx; uhx; u by letting v2 scale as 2NH.
The sum in eq. 4.27 is over NH identically and independently distributed random variables. As the transfer function is bounded, all moments of the distribution will be bounded and hence the central limit theorem can be applied, showing that the stochastic process will converge to a Gaussian process in the limit as NH.
By evaluating Eu hx; uhx ; u we can obtain the covariance function of the neural network. For example if we choose the error function hzerfz
neural network covariance function
2z et2 dt as the transfer function, let hx;uerfu0 D
0 j1
ujxj and
4.29
where x 1, x1 , . . . , xdis an augmented input vector. This is a true neural network covariance function. The sigmoid kernel kx, xtanhabxx has sometimes been proposed, but in fact this kernel is never positive defi nite and is thus not a valid covariance function, see, e.g. Sch olkopf and Smola 2002, p. 113. Figure 4.5 shows a plot of the neural network covariance function and samples from the prior. We have set diag02, 2. Samples from a GP with this covariance function can be viewed as superpositions of the functions erf u0ux, where 02 controls the variance of u0 and thus the amount of offset of these functions from the origin, and 2 controls u and thus the scaling on the xaxis. In Figure 4.5b we observe that the sample functions with largervary more quickly. Notice that the samples display the nonstationarity of the covariance function in that for large values of x or x they should tend to a constant value, consistent with the construction as a superposition of sigmoid functions.
Another interesting construction is to set hx; uexpxu2 2g2 , where g sets the scale of this Gaussian basis function. With uN 0, u2 I
choose uN 0,then we obtain Williams, 1998 212x x
kNNx, xsin 12x x 12x x,
modulated squared exponential

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
92
Covariance Functions
1031
warping
periodic random function
4
00
0.5
0.95
1
0.5
we obtain
1

exp x x
xu2
2g2
x u22g2
uu
2u2 du
kGx,x2u2d2
0.95
input, x
output, fx
0.5 0 0
0.5
1
4
4 0 4 4 0 4
input, x input, x
a, covariance b, sample functions
Figure 4.5: Panel a: a plot of the covariance function kNN x, xfor 010, 10. Panel b: samples drawn from the neural network covariance function with 02 andas shown in the legend. The samples were obtained using a discretization of the xaxis of 500 equallyspaced points.
4.30
where 1e22g21u2, s22g2g4u2 and m22u2g2. This is in general a nonstationary covariance function, but if u2 while scaling 2 appropriately we recover the squared exponential kGx, xexpxx24g2. For a finite value of u2, kGx,x comprises a squared exponen tial covariance function modulated by the Gaussian decay envelope function expxx2m2 expxx2m2 , cf. the vertical rescaling construction de scribed in section 4.2.4.
One way to introduce nonstationarity is to introduce an arbitrary nonlinear mapping or warping ux of the input x and then use a stationary covariance function in uspace. Note that x and u need not have the same dimensionality as each other. This approach was used by Sampson and Guttorp 1992 to model patterns of solar radiation in southwestern British Columbia using Gaussian processes.
Another interesting example of this warping construction is given in MacKay 1998 where the onedimensional input variable x is mapped to the twodimensional uxcosx,sinx to give rise to a periodic random function of x. If we use the squared exponential kernel in uspace, then
e d

xx 222
x xexp22 ,
exp22 umsm
kx,xexp
l2
2 ,
4.31
as cosxcosx 2sinxsinx 24 sin2xx . 2
exp
2sin2 xx

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.2 Examples of Covariance Functions
93
1.5 1 0.5 0
01234
input, x
1
0
1
01234
input, x
a b
Figure 4.6: Panel a shows the chosen lengthscale function lx. Panel b shows three samples from the GP prior using Gibbs covariance function eq. 4.32. This figure is based on Fig. 3.9 in Gibbs 1997.
We have described above how to make an anisotropic covariance function by scaling different dimensions differently. However, we are not free to make these lengthscales ld be functions of x, as this will not in general produce a valid covariance function. Gibbs 1997 derived the covariance function
D2ldxldx12 D xdxd2
kx, x l2dxl2dx expl2dxl2dx , 4.32
d1 d1
where each lix is an arbitrary positive function of x. Note that kx,x1 for all x. This covariance function is obtained by considering a grid of N Gaussian basis functions with centres cj and a corresponding lengthscale on input dimension d which varies as a positive function ldcj. Taking the limit as N the sum turns into an integral and after some algebra eq. 4.32 is obtained.
An example of a variable lengthscale function and samples from the prior corresponding to eq. 4.32 are shown in Figure 4.6. Notice that as the length scale gets shorter the sample functions vary more rapidly as one would expect. The large lengthscale regions on either side of the short lengthscale region can be quite strongly correlated. If one tries the converse experiment by creating a lengthscale function lx which has a longer lengthscale region between two shorter ones then the behaviour may not be quite what is expected; on initially transitioning into the long lengthscale region the covariance drops off quite sharply due to the prefactor in eq. 4.32, before stabilizing to a slower variation. See Gibbs 1997, sec. 3.10.3 for further details. Exercises 4.5.4 and 4.5.5 invite you to investigate this further.
Paciorek and Schervish 2004 have generalized Gibbs construction to obtain nonstationary versions of arbitrary isotropic covariance functions. Let kS be a
varying lengthscale
lengthscale lx
output, fx

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
94
Covariance Functions
covariance function
expression
S ND

constant
linear
polynomial
squared exponential
Mat ern exponential exponential rational quadratic neural network
02
Dd1 d2xdxd
xx02p
expr22l2
1 2221 lrK lr
exp rl
expr l
1 r22l2
sin12x x 12x x12x x
Wiener process
Table 4.1: Summary of several commonlyused covariance functions. The covariances are written either as a function of x and x, or as a function of rxx. Two columns marked S and ND indicate whether the covariance functions are stationary and nondegenerate respectively. Degenerate covariance functions have finite rank, see section 4.3 for more discussion of this issue.
stationary, isotropic covariance function that is valid in every Euclidean space RD for D1,2,. Let x be a DD matrixvalued function which is positive definite for all x, and let ixi. The set of Gibbs lix functions define a diagonal x. Then define the quadratic form
Qijxixj ij 21xixj . 4.33 Paciorek and Schervish 2004 show that
kNSxi,xj2D2i14j14i j12kSQij, 4.34 is a valid nonstationary covariance function.
In chapter 2 we described the linear regression model in feature space fxxw. OHagan 1978 suggested making w a function of x to allow for different values of w to be appropriate in different regions. Thus he put a Gaussian process prior on w of the form covwx,wxW0kwx,x for some positive definite matrix W0, giving rise to a prior on fx with covariance kf x, xxW0xkwx, x.
Finally we note that the Wiener process with covariance function kx, xminx, x is a fundamental nonstationary process. See section B.2.1 and texts such as Grimmett and Stirzaker 1992, ch. 13 for further details.
4.2.4 Making New Kernels from Old
In the previous sections we have developed many covariance functions some of which are summarized in Table 4.1. In this section we show how to combine or modify existing covariance functions to make new ones.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.2 Examples of Covariance Functions
The sum of two kernels is a kernel. Proof: consider the random process fxf1xf2x, where f1x and f2x are independent. Then kx, xk1x, xk2x, x. This construction can be used e.g. to add together kernels with different characteristic lengthscales.
The product of two kernels is a kernel. Proof: consider the random process fxf1xf2x, where f1x and f2x are independent. Then kx,xk1x,xk2x,x.8 Asimpleextensionofthisargumentmeansthatkpx,xis a valid covariance function for pN.
Let ax be a given deterministic function and consider gxaxfx where fx is a random process. Then covgx, gxaxkx, xax. Such a construction can be used to normalize kernels by choosing axk12x, x assuming kx, x0 x, so that
kx,x
kx,x kx,xkx,x. 4.35
This ensures that k x, x1 for all x.
We can also obtain a new process by convolution or blurring. Consider an arbitrary fixed kernel hx, z and the map gx hx, zf z dz. Then clearly covgx, gx hx, zkz, zhx, z dz dz.
If kx1, x1 and kx2, x2 are covariance functions over different spaces X1 and X2, then the direct sum kx, xk1x1, x1k2x2, x2 and the tensor product kx, xk1x1, x1k2x2, x2 are also covariance functions defined on the product space X1 X2, by virtue of the sum and product constructions.
The direct sum construction can be further generalized. Consider a func
tion fx, where x is Ddimensional. An additive model Hastie and Tib
shirani, 1990 has the form f xcDi1 fi xi , i.e. a linear combina
tion of functions of one variable. If the individual fis are taken to be in
dependent stochastic processes, then the covariance function of f will have the
form of a direct sum. If we now admit interactions of two variables, so that
fxcDi1fixiij,ji fijxi,xj and the various fis and fijs are
independent stochastic processes, then the covariance function will have the
form kx, xD kixi, xiD i1 kij xi, xj ; xi, xj . Indeed this pro i1 i2 j1
cess can be extended further to provide a functional ANOVA9 decomposition, ranging from a simple additive model up to full interaction of all D input vari ables. The sum can also be truncated at some stage. Wahba 1990, ch. 10 and Stitson et al. 1999 suggest using tensor products for kernels with inter actions so that in the example above kijxi,xj;xi,xj would have the form kixi;xikjxj;xj. Note that if D is large then the large number of pairwise or higherorder terms may be problematic; Plate 1999 has investigated using a combination of additive GP models plus a general covariance function that permits full interactions.
8If f1 and f2 are Gaussian processes then the product f will not in general be a Gaussian process, but there exists a GP with this covariance function.
9ANOVA stands for analysis of variance, a statistical technique that analyzes the interac tions between various attributes.
95
sum
product
vertical rescaling
convolution
direct sum tensor product
additive model
functional ANOVA

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
96
Covariance Functions
4.3 Eigenfunction Analysis of Kernels
We first define eigenvalues and eigenfunctions and discuss Mercers theorem which allows us to express the kernel under certain conditions in terms of these quantities. Section 4.3.1 gives the analytical solution of the eigenproblem for the SE kernel under a Gaussian measure. Section 4.3.2 discusses how to compute approximate eigenfunctions numerically for cases where the exact solution is not known.
It turns out that Gaussian process regression can be viewed as Bayesian linear regression with a possibly infinite number of basis functions, as discussed in chapter 2. One possible basis set is the eigenfunctions of the covariance function. A functionthat obeys the integral equation

eigenvalue, eigenfunction
Mercers theorem
is called an eigenfunction of kernel k with eigenvaluewith respect to measure10 . The two measures of particular interest to us will be i Lebesgue measure over a compact subset C of RD, or ii when there is a density px so that dx can be written pxdx.
In general there are an infinite number of eigenfunctions, which we label 1x, 2x, We assume the ordering is chosen such that 12. The eigenfunctions are orthogonal with respect toand can be chosen to be normalized so thatixjx dxij where ij is the Kronecker delta.
Mercers theorem see, e.g. K onig, 1986 allows us to express the kernel k in terms of the eigenvalues and eigenfunctions.
Theorem 4.2 Mercers theorem. Let X, be a finite measure space and kLX2,2 be a kernel such that Tk : L2X,L2X, is positive definite see eq. 4.2. Let iL2X, be the normalized eigenfunctions of Tk associated with the eigenvalues i0. Then:
kx,xxdxx, 4.36
1. the eigenvalues ii1 are absolutely summable
2.
This decomposition is just the infinitedimensional analogue of the diagonaliza tion of a Hermitian matrix. Note that the sum may terminate at some value NN i.e. the eigenvalues beyond N are zero, or the sum may be infinite. We have the following definition Press et al., 1992, p. 794
10For further explanation of measure see Appendix A.7.

kx,xiixix,
4.37
i1
holds 2 almost everywhere, where the series converges absolutely and
uniformly 2 almost everywhere.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.3 Eigenfunction Analysis of Kernels
Definition 4.1 A degenerate kernel has only a finite number of nonzero eigen
values.A degenerate kernel is also said to have finite rank. If a kernel is not degenerate
it is said to be nondegenerate. As an example a Ndimensional linear regression model in feature space see eq. 2.10 gives rise to a degenerate kernel with at most N nonzero eigenvalues. Of course if the measure only puts weight on a finite number of points n in xspace then the eigendecomposition is simply that of a nn matrix, even if the kernel is nondegenerate.
The statement of Mercers theorem above referred to a finite measure . If we replace this with Lebesgue measure and consider a stationary covariance function, then directly from Bochners theorem eq. 4.5 we obtain
97
degenerate, nondegenerate kernel
2isxx kxxe
RD
ds
2isx 2isx
e e ds. 4.38
RD
The complex exponentials e2isx are the eigenfunctions of a stationary kernel w.r.t. Lebesgue measure. Note the similarity to eq. 4.37 except that the summation has been replaced by an integral.
The rate of decay of the eigenvalues gives important information about the smoothness of the kernel. For example Ritter et al. 1995 showed that in 1d withuniform on 0, 1, processes which are rtimes meansquare differentiable have ii2r2 asymptotically. This makes sense as rougher processes have more power at high frequencies, and so their eigenvalue spectrum decays more slowly. The same phenomenon can be read off from the power spectrum of the Mat ern class as given in eq. 4.15.
Hawkins 1989 gives the exact eigenvalue spectrum for the OU process on 0,1. Widom 1963; 1964 gives an asymptotic analysis of the eigenvalues of stationary kernels taking into account the effect of the density dxpxdx; Bach and Jordan 2002, Table 3 use these results to show the effect of varying px for the SE kernel. An exact eigenanalysis of the SE kernel under the Gaussian density is given in the next section.
4.3.1 An Analytic Example
For the case that px is a Gaussian and for the squaredexponential kernel kx, xexpxx22l2, there are analytic results for the eigenvalues and eigenfunctions, as given by Zhu et al. 1998, sec. 4. Putting pxNx0,2 we find that the eigenvalues k and eigenfunctions k for convenience let k0,1, are given by
2a k
kAB, 4.39
kxexpcax2Hk2cx, 4.40

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
98
Covariance Functions
0.4
0.2
0
0.2
2 0 2
Figure 4.7: The first 3 eigenfunctions of the squared exponential kernel w.r.t. a Gaussian density. The value of k0, 1, 2 is equal to the number of zerocrossings of the function. The dashed line is proportional to the density px.
where H x1k expx2 dk expx2 is the kth order Hermite polynomial k dxk
see Gradshteyn and Ryzhik 1980, sec. 8.95, a142, b12l2 and
ca22ab, Aabc, BbA. 4.41
Hints on the proof of this result are given in exercise 4.5.9. A plot of the first three eigenfunctions for a1 and b3 is shown in Figure 4.7.
The result for the eigenvalues and eigenfunctions is readily generalized to the multivariate case when the kernel and Gaussian density are products of the univariate expressions, as the eigenfunctions and eigenvalues will simply be products too. For the case that a and b are equal on all D dimensions,
the degeneracy of the eigenvalue 2aD2Bk is kD1 which is OkD1. As k jD1 kD A kD D1
j0 D1D we see that the D th eigenvalue has a value given by 2aD2Bk, and this can be used to determine the rate of decay of the spectrum.
A
4.3.2 Numerical Approximation of Eigenfunctions
The standard numerical method for approximating the eigenfunctions and eigen values of eq. 4.36 is to use a numerical routine to approximate the integral see, e.g. Baker 1977, ch. 3. For example letting dxpxdx in eq. 4.36 one could use the approximation
iix
1 n
kx,xpxixdxn
kxl,xixl, 4.42
l1

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.4 Kernels for Nonvectorial Inputs
where the xls are sampled from px. Plugging in xxl for l1,,n into
eq. 4.42 we obtain the matrix eigenproblem
Ku matu, 4.43
99
iii
where K is the nn Gram matrix with entries Kkx ,x , mat is the ith ij iji
matrix eigenvalue and ui is the corresponding eigenvector normalized so that
ui ui1. We have i xj nui j where the n factor arises from the
differing normalizations of the eigenvector and eigenfunction. Thus 1 mat is ni
an obvious estimator for i for i1, . . . , n. For fixed n one would expect that
the larger eigenvalues would be better estimated than the smaller ones. The
theory of the numerical solution of eigenvalue problems shows that for a fixed i,
1 mat will converge toin the limit that n Baker, 1977, Theorem 3.4. nii
It is also possible to study the convergence further; for example it is quite
easy using the properties of principal components analysis PCA in feature
spacetoshowthatforanyl,1ln,E1l matland nii
1n mat N n i1 i1
Enn il1i il1i, where En denotes expectation with respect to
samples of size n drawn from px. For further details see ShaweTaylor and Williams 2003.
The Nystr om method for approximating the ith eigenfunction see Baker 1977 and Press et al. 1992, section 18.1 is given by

ixn kxui, 4.44
mat i
where kxkx1, x, . . . , kxn, x, which is obtained from eq. 4.42 by dividing both sides by i. Equation 4.44 extends the approximation ixj
nuij from the sample points x1, . . . , xn to all x.
There is an interesting relationship between the kernel PCA method of Sch olkopf et al. 1998 and the eigenfunction expansion discussed above. The eigenfunction expansion has at least potentially an infinite number of non zero eigenvalues. In contrast, the kernel PCA algorithm operates on the nn matrix K and yields n eigenvalues and eigenvectors. Eq. 4.42 clarifies the relationship between the two. However, note that eq. 4.44 is identical up to scaling factors to Sch olkopf et al. 1998, eq. 4.1 which describes the projection of a new point x onto the ith eigenvector in the kernel PCA feature space.
4.4 Kernels for Nonvectorial Inputs
So far in this chapter we have assumed that the input x is a vector, measuring the values of a number of attributes or features. However, for some learning problems the inputs are not vectors, but structured objects such as strings, trees or general graphs. For example, we may have a biological problem where we want to classify proteins represented as strings of amino acid symbols.11
11Proteins are initially made up of 20 different amino acids, of which a few may later be modified bringing the total number up to 26 or 30.
Nystr om method
kernel PCA

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
100
Covariance Functions
Or our input may be parsetrees derived from a linguistic analysis. Or we may wish to represent chemical compounds as labelled graphs, with vertices denoting atoms and edges denoting bonds.
To follow the discriminative approach we need to extract some features from the input objects and build a predictor using these features. For a classification problem, the alternative generative approach would construct classconditional models over the objects themselves. Below we describe two approaches to this feature extraction problem and the efficient computation of kernels from them: in section 4.4.1 we cover string kernels, and in section 4.4.2 we describe Fisher kernels. There exist other proposals for constructing kernels for strings, for example Watkins 2000 describes the use of pair hidden Markov models HMMs that generate output symbols for two strings conditional on the hidden state for this purpose.
4.4.1 String Kernels
We start by defining some notation for strings. Let A be a finite alphabet of characters. The concatenation of strings x and y is written xy and x denotes the length of string x. The string s is a substring of x if we can write xusv for some possibly empty u, s and v.
Let sx denote the number of times that substring s appears in string x. Then we define the kernel between two strings x and x as
kx, x wssxsx, 4.45 sA
where ws is a nonnegative weight for substring s. For example, we could set wss, where 01, so that shorter substrings get more weight than longer ones.
A number of interesting special cases are contained in the definition 4.45:
Setting ws0 for s1 gives the bagofcharacters kernel. This takes the feature vector for a string x to be the number of times that each character in A appears in x.
In text analysis we may wish to consider the frequencies of word occur rence. If we require s to be bordered by whitespace then a bagofwords representation is obtained. Although this is a very simple model of text which ignores word order it can be surprisingly effective for document classification and retrieval tasks, see e.g. Hand et al. 2001, sec. 14.3. The weights can be set differently for different words, e.g. using the term frequency inverse document frequency TFIDF weighting scheme de veloped in the information retrieval area Salton and Buckley, 1988.
If we only consider substrings of length k, then we obtain the kspectrum kernel Leslie et al., 2003.
bagofcharacters
bagofwords
kspectrum kernel

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.4 Kernels for Nonvectorial Inputs
Importantly, there are efficient methods using suffix trees that can compute a string kernel kx, x in time linear in xx with some restrictions on the weights ws Leslie et al., 2003, Vishwanathan and Smola, 2003.
Work on string kernels was started by Watkins 1999 and Haussler 1999. There are many further developments of the methods we have described above; for example Lodhi et al. 2001 go beyond substrings to consider subsequences of x which are not necessarily contiguous, and Leslie et al. 2003 describe mismatch string kernels which allow substrings s and s of x and x respectively to match if there are at most m mismatches between them. We expect further developments in this area, tailoring or engineering the string kernels to have properties that make sense in a particular domain.
The idea of string kernels, where we consider matches of substrings, can easily be extended to trees, e.g. by looking at matches of subtrees Collins and Duffy, 2002.
Leslie et al. 2003 have applied string kernels to the classification of protein domains into SCOP12 superfamilies. The results obtained were significantly better than methods based on either PSIBLAST13 searches or a generative hidden Markov model classifier. Similar results were obtained by Jaakkola et al. 2000 using a Fisher kernel described in the next section. Saunders et al. 2003 have also described the use of string kernels on the problem of classifying natural language newswire stories from the Reuters2157814 database into ten classes.
4.4.2 Fisher Kernels
As explained above, our problem is that the input x is a structured object of arbitrary size e.g. a string, and we wish to extract features from it. The Fisher kernel introduced by Jaakkola et al., 2000 does this by taking a generative model px, whereis a vector of parameters, and computing the feature vector x logpx. x is sometimes called the score vector.
101
Take, for example, a Markov model for strings. Let xk be the kth symbol in string x. Then a Markov model gives pxpx1x1pxi1xi,A,
i1
where , A. Here j gives the probability that x1 will be the jth symbol
in the alphabet A, and A is a AA stochastic matrix, with ajk giving the probability that pxi1kxij. Given such a model it is straightforward to compute the score vector for a given x.
It is also possible to consider other generative models px. For example we might try a kthorder Markov model where xi is predicted by the preceding k symbols. See Leslie et al. 2003 and Saunders et al. 2003 for an interesting discussion of the similarities of the features used in the kspectrum kernel and the score vector derived from an order k1 Markov model; see also exercise
12Structural classification of proteins database, http:scop.mrclmb.cam.ac.ukscop.
13PositionSpecific Iterative Basic Local Alignment Search Tool, see http:www.ncbi.nlm.nih.govEducationBLASTinfopsi1.html.
14 http:www.daviddlewis.comresourcestestcollectionsreuters21578.
score vector

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
102
Covariance Functions
4.5.12. Another interesting choice is to use a hidden Markov model HMM as the generative model, as discussed by Jaakkola et al. 2000. See also exercise 4.5.11 for a linear kernel derived from an isotropic Gaussian model for xRD.
We define a kernel kx,x based on the score vectors for x and x. One simple choice is to set
kx,xxM1x, 4.46 where M is a strictly positive definite matrix. Alternatively we might use the
squared exponential kernel kx,xexpxx2 for some 0.
The structure of px asvaries has been studied extensively in informa tion geometry see, e.g. Amari, 1985. It can be shown that the manifold of log px is Riemannian with a metric tensor which is the inverse of the Fisher information matrix F, where
FExx x. 4.47
Setting MF in eq. 4.46 gives the Fisher kernel . If F is difficult to compute then one might resort to setting MI. The advantage of using the Fisher information matrix is that it makes arc length on the manifold invariant to reparameterizations of .
The Fisher kernel uses a classindependent model px. Tsuda et al. 2002 have developed the tangent of posterior odds TOP kernel based onlog py1x, log py1x, , which makes use of classconditional distributions for the C and C classes.
4.5 Exercises
1. The OU process with covariance function kxxexpxxl
is the unique stationary firstorder Markovian Gaussian process see Ap pendix B for further details. Consider training inputs x1x2 . . .xn1 xn onRwithcorrespondingfunctionvaluesf fx1,,fxn. Let xl denote the nearest training input to the left of a test point x, and similarly let xu denote the nearest training input to the right of x. Then the Markovian property means that pfxfpfxfxl,fxu. Demonstrate this by choosing some xpoints on the line and computing the predictive distribution pfxf using eq. 2.19, and observing that nonzero contributions only arise from xl and xu. Note that this only occurs in the noisefree case; if one allows the training points to be cor rupted by noise equations 2.23 and 2.24 then all points will contribute in general.
2. Computer exercise: write code to draw samples from the neural network covariance function, eq. 4.29 in 1d and 2d. Consider the cases when varu0 is either 0 or nonzero. Explain the form of the plots obtained when varu00.
Fisher information matrix
Fisher kernel
TOP kernel

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
4.5 Exercises 103
3. Consider the random process fxerfu0Di1ujxj, where uN0,. Show that this nonlinear transform of a process with an inho mogeneous linear covariance function has the same covariance function as the erf neural network. However, note that this process is not a Gaussian process. Draw samples from the given process and compare them to your results from exercise 4.5.2.
4. Derive Gibbs nonstationary covariance function, eq. 4.32.
5. Computer exercise: write code to draw samples from Gibbs nonstationary covariance function eq. 4.32 in 1d and 2d. Investigate various forms of lengthscale function lx.
6. Show that the SE process is infinitely MS differentiable and that the OU process is not MS differentiable.
7. Provethattheeigenfunctionsofasymmetrickernelareorthogonalw.r.t.the measure .
8. Let k x,xp12xkx,xp12x, and assume px0 for all x. Show that the eigenproblemk x,x ixdx i ix has the same eigenvalues askx,xpxixdxiix, and that the eigenfunc tions are related byixp12xix. Also give the matrix version of this problem Hint: introduce a diagonal matrix P to take the role of px. The significance of this connection is that it can be easier to find eigenvalues of symmetric matrices than general matrices.
9. Apply the construction in the previous exercise to the eigenproblem for the SE kernel and Gaussian density given in section 4.3.1, with px2a exp2ax2. Thus consider the modified kernel given by k x, xexpax2 expbxx2 expax2. Using equation 7.374.8 in Grad shteyn and Ryzhik 1980:
22n2 y
exp xy Hnxdx1Hn 1212 ,
verify thatkxexpcx2Hk2cx, and thus confirm equations 4.39 and 4.40.
10. Computer exercise: The analytic form of the eigenvalues and eigenfunc tions for the SE kernel and Gaussian density are given in section 4.3.1. Compare these exact results to those obtained by the Nystr om approxi mation for various values of n and choice of samples.
11. Let xN , 2I. Consider the Fisher kernel derived from this model with respect to variation ofi.e. regard 2 as a constant. Show that:
log px x
2
and that F2I. Thus the Fisher kernel for this model with 0 is
the linear kernel kx, x1 xx. 2
0

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
104
Covariance Functions
Consider a k1 order Markov model for strings on a finite alphabet. Let this model have parameters ts1,,sk1 denoting the probability pxitxi1s1, . . . , xk1sk1. Of course as these are probabilities they obey the constraint that t t s1 ,,sk11. Enforcing this constraint can be achieved automatically by setting
12.
t,s1 ,,sk1 ts1 ,,sk1 ,
t t,s1,,sk1
where the t,s1,,sk1 parameters are now independent, as suggested in
Jaakkola et al., 2000. The current parameter values are denoted 0. 00
Let the current values of t,s1,,sk1 be set so that tt,s1,,sk11, i.e. that 00 .
t,s1 ,,sk1 ts1 ,,sk1
Show that log px nt,s1 ,,sk1 log ts1 ,,sk1 where nt,s1 ,,sk1 is the number of instances of the substring sk1 . . . s1t in x. Thus, following Leslie et al. 2003, show that
log pxt,s1,,sk1
0
nt,s1 ,,sk1ns1 ,,sk1 , 0
ts1 ,,sk1
where ns1 ,,sk1 is the number of instances of the substring sk1 . . . s1 in

x. As ns1,,sk10 is the expected number of occurrences of the ts1 ,,sk1
string sk1 . . . s1t given the count ns1,,sk1 , the Fisher score captures the degree to which this string is over or underrepresented relative to the model. For the kspectrum kernel the relevant feature is sk1,s1,txnt,s1 ,,sk1 .

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] R C algorithm Scheme html math scala database graph statistic network Bayesian theory C. E. Rasmussen C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
$25