C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
Chapter 2
Regression
Supervised learning can be divided into regression and classification problems. Whereas the outputs for classification are discrete class labels, regression is concerned with the prediction of continuous quantities. For example, in a fi nancial application, one may attempt to predict the price of a commodity as a function of interest rates, currency exchange rates, availability and demand. In this chapter we describe Gaussian process methods for regression problems; classification problems are discussed in chapter 3.
There are several ways to interpret Gaussian process GP regression models. One can think of a Gaussian process as defining a distribution over functions, and inference taking place directly in the space of functions, the functionspace view. Although this view is appealing it may initially be difficult to grasp, so we start our exposition in section 2.1 with the equivalent weightspace view which may be more familiar and accessible to many, and continue in section 2.2 with the functionspace view. Gaussian processes often have characteristics that can be changed by setting certain parameters and in section 2.3 we discuss how the properties change as these parameters are varied. The predictions from a GP model take the form of a full predictive distribution; in section 2.4 we discuss how to combine a loss function with the predictive distributions using decision theory to make point predictions in an optimal way. A practical comparative example involving the learning of the inverse dynamics of a robot arm is presented in section 2.5. We give some theoretical analysis of Gaussian process regression in section 2.6, and discuss how to incorporate explicit basis functions into the models in section 2.7. As much of the material in this chapter can be considered fairly standard, we postpone most references to the historical overview in section 2.8.
2.1 Weightspace View
The simple linear regression model where the output is a linear combination of the inputs has been studied and used extensively. Its main virtues are simplic
two equivalent views
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
8
Regression
ity of implementation and interpretability. Its main drawback is that it only allows a limited flexibility; if the relationship between input and output can not reasonably be approximated by a linear function, the model will give poor predictions.
In this section we first discuss the Bayesian treatment of the linear model. We then make a simple enhancement to this class of models by projecting the inputs into a highdimensional feature space and applying the linear model there. We show that in some feature spaces one can apply the kernel trick to carry out computations implicitly in the high dimensional space; this last step leads to computational savings when the dimensionality of the feature space is large compared to the number of data points.
We have a training set D of n observations, Dxi,yii1,,n, where x denotes an input vector covariates of dimension D and y denotes a scalar output or target dependent variable; the column vector inputs for all n cases are aggregated in the Dn design matrix1 X, and the targets are collected in the vector y, so we can write DX,y. In the regression setting the targets are real values. We are interested in making inferences about the relationship between inputs and targets, i.e. the conditional distribution of the targets given the inputs but we are not interested in modelling the input distribution itself.
2.1.1 The Standard Linear Model
We will review the Bayesian analysis of the standard linear regression model with Gaussian noise
fxxw, yfx, 2.1
where x is the input vector, w is a vector of weights parameters of the linear model, f is the function value and y is the observed target value. Often a bias weight or offset is included, but as this can be implemented by augmenting the input vector x with an additional element whose value is always one, we do not explicitly include it in our notation. We have assumed that the observed values y differ from the function values fx by additive noise, and we will further assume that this noise follows an independent, identically distributed Gaussian distribution with zero mean and variance n2
N0, n2. 2.2 This noise assumption together with the model directly gives rise to the likeli
hood, the probability density of the observations given the parameters, which is
1In statistics texts the design matrix is usually taken to be the transpose of our definition, but our choice is deliberate and has the advantage that a data point is a standard column vector.
training set
design matrix
bias, offset
likelihood
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.1 Weightspace View
factored over cases in the training set because of the independence assumption to give
9
pyX,w
pyixi,w i1 i1
exp2n2
nn
1 yixiw2
1 2n2 n2
2n
exp 1 yXw2NXw,n2I,
2n2
2.3
where z denotes the Euclidean length of vector z. In the Bayesian formalism we need to specify a prior over the parameters, expressing our beliefs about the parameters before we look at the observations. We put a zero mean Gaussian prior with covariance matrix p on the weights
wN0, p. 2.4 The role and properties of this prior will be discussed in section 2.2; for now
we will continue the derivation with the prior as specified.
Inference in the Bayesian linear model is based on the posterior distribution
over the weights, computed by Bayes rule, see eq. A.32
posteriorlikelihoodprior , pwy,XpyX,wpw, 2.5
marginal likelihood pyX
where the normalizing constant, also known as the marginal likelihood see page 19, is independent of the weights and given by
prior
posterior
marginal likelihood
pyX
pyX,wpwdw. 2.6
The posterior in eq. 2.5 combines the likelihood and the prior, and captures everything we know about the parameters. Writing only the terms from the likelihood and prior which depend on the weights, and completing the square we obtain
pwX,yexp 1 yXwyXwexp 1w1w 2n2 2p
exp 1ww1 XX 1ww , 2.7 2 n2 p
where w 22XX11Xy, and we recognize the form of the nnp
posterior distribution as Gaussian with mean wand covariance matrix A1 pwX, yN w 1 A1Xy, A1, 2.8
n2
where A2XX1. Notice that for this model and indeed for any
np
Gaussian posterior the mean of the posterior distribution pwy,X is also its mode, which is also called the maximum a posteriori MAP estimate of
2Often Bayes rule is stated as pabpbapapb; here we use it in a form where we additionally condition everywhere on the inputs X but neglect this extra conditioning for the prior which is independent of the inputs.
MAP estimate
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
10
Regression
2
1
5
slope, w slope, w 22
slope, w output, y 2
00
1
2
2 1 0 1 2
intercept, w1
a
5
22
11
00
1
2
2 1 0 1 2
intercept, w 11
c d
Figure 2.1: Example of Bayesian linear model fxw1w2x with intercept w1 and slope parameter w2. Panel a shows the contours of the prior distribution pwN0,I, eq. 2.4. Panel b shows three training points marked by crosses. Panel c shows contours of the likelihood pyX, w eq. 2.3, assuming a noise level of n1; note that the slope is much more well determined than the intercept. Panel d shows the posterior, pwX, y eq. 2.7; comparing the maximum of the posterior to the likelihood, we see that the intercept has been shrunk towards zero whereas the more well determined slope is almost unchanged. All contour plots give the 1 and 2 standard deviation equiprobability contours. Superimposed on the data in panel b are the predictive mean plusminus two standard deviations of the noisefree predictive distribution pfx,X,y, eq. 2.9.
w. In a nonBayesian setting the negative log prior is sometimes thought of as a penalty term, and the MAP point is known as the penalized maximum likelihood estimate of the weights, and this may cause some confusion between the two approaches. Note, however, that in the Bayesian setting the MAP estimate plays no special role.3 The penalized maximum likelihood procedure
3In this case, due to symmetries in the model and posterior, it happens that the mean of the predictive distribution is the same as the prediction at the mean of the posterior. However, this is not the case in general.
1
5 0 5
input, x
b
2
2 1 0 1 2
intercept, w
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.1 Weightspace View
11
ridge regression predictive distribution
is known in this case as ridge regression Hoerl and Kennard, 1970 because of the effect of the quadratic penalty term 1w1w from the log prior.
2p
To make predictions for a test case we average over all possible parameter
values, weighted by their posterior probability. This is in contrast to non Bayesian schemes, where a single parameter is typically chosen by some crite rion. Thus the predictive distribution for ffx at x is given by averaging the output of all possible linear models w.r.t. the Gaussian posterior
The predictive distribution is again Gaussian, with a mean given by the poste rior mean of the weights from eq. 2.8 multiplied by the test input, as one would expect from symmetry considerations. The predictive variance is a quadratic form of the test input with the posterior covariance matrix, showing that the predictive uncertainties grow with the magnitude of the test input, as one would expect for a linear model.
An example of Bayesian linear regression is given in Figure 2.1. Here we have chosen a 1d input space so that the weightspace is twodimensional and can be easily visualized. Contours of the Gaussian prior are shown in panel a. The data are depicted as crosses in panel b. This gives rise to the likelihood shown in panel c and the posterior distribution in panel d. The predictive distribution and its error bars are also marked in panel b.
2.1.2 Projections of Inputs into Feature Space
In the previous section we reviewed the Bayesian linear model which suffers from limited expressiveness. A very simple idea to overcome this problem is to first project the inputs into some high dimensional space using a set of basis functions and then apply the linear model in this space instead of directly on the inputs themselves. For example, a scalar input x could be projected into the space of powers of x: x1, x, x2, x3, . . . to implement polynomial regression. As long as the projections are fixed functions i.e. independent of the parameters w the model is still linear in the parameters, and therefore analytically tractable.4 This idea is also used in classification, where a dataset which is not linearly separable in the original data space may become linearly separable in a high dimensional feature space, see section 3.3. Application of this idea begs the question of how to choose the basis functions? As we shall demonstrate in chapter 5, the Gaussian process formalism allows us to answer this question. For now, we assume that the basis functions are given.
Specifically, we introduce the function x which maps a Ddimensional input vector x into an N dimensional feature space. Further let the matrix
4Models with adaptive basis functions, such as e.g. multilayer perceptrons, may at first seem like a useful extension, but they are much harder to treat, except in the limit of an infinite number of hidden units, see section 4.2.3.
pfx,X,y
N 1 x A1Xy, x A1x.
pfx,wpwX,ydw n2
2.9
feature space
polynomial regression linear in the parameters
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
12
explicit feature space formulation
alternative formulation
Regression X be the aggregation of columns x for all cases in the training set. Now
the model is
fxxw, 2.10
where the vector of parameters now has length N. The analysis for this model is analogous to the standard linear model, except that everywhere X is substituted for X. Thus the predictive distribution becomes
fx,X,yN 1 xA1y, xA1x 2.11 n2
with X and A21. To make predictions using this np
equation we need to invert the A matrix of size NN which may not be convenient if N, the dimension of the feature space, is large. However, we can rewrite the equation in the following way
fx, X, yNpKn2 I1y,
p pKn2 I1p, 2.12
where we have used the shorthand xand defined K p .
To show this for the mean, first note that using the definitions of A and K
we have 2K2 I2 2 IA . Now multiplying nnnpnp
through by A1 from left and K2 I1 from the right gives 2A1nn
p Kn2 I 1 , showing the equivalence of the mean expressions in eq. 2.11 and eq. 2.12. For the variance we use the matrix inversion lemma, eq. A.9, setting Z1p, W1n2I and VU therein. In eq. 2.12 we need to invert matrices of size nn which is more convenient when nN . Geometrically, note that n datapoints can span at most n dimensions in the feature space.
Notice that in eq. 2.12 the feature space always enters in the form of p,p, orp; thus the entries of these matrices are invariably of the form xpx where x and x are in either the training or the test sets. Letusdefinekx,xxpx. Forreasonsthatwillbecomeclearlater
computational load
kernel
kernel trick
we call k, a covariance function or kernel. Notice that xpx is an 12
inner product with respect to p. As p is positive definite we can define p 12 2
so that p p; for example if the SVD singular value decomposition12 12
ofp UDU ,whereDisdiagonal,thenoneformforp isUD U . 12
Then defining xp x we obtain a simple dot product representation kx, xxx.
If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by kx, x; this is sometimes called the kernel trick. This technique is particularly valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves. As we will see in the coming sections, this often leads to considering the kernel as the object of primary interest, and its corresponding feature space as having secondary practical importance.
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.2 Functionspace View
2.2 Functionspace View
An alternative and equivalent way of reaching identical results to the previous section is possible by considering inference directly in function space. We use a Gaussian process GP to describe a distribution over functions. Formally:
Definition 2.1 A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
A Gaussian process is completely specified by its mean function and co variance function. We define mean function mx and the covariance function kx,x of a real process fx as
13
Gaussian process
covariance and mean function
mxEfx,
kx,xEfxmxfxmx,
and will write the Gaussian process as
fxGPmx,kx,x.
2.13
2.14
Usually, for notational simplicity we will take the mean function to be zero, although this need not be done, see section 2.7.
In our case the random variables represent the value of the function fx at location x. Often, Gaussian processes are defined over time, i.e. where the index set of the random variables is time. This is not normally the case in our use of GPs; here the index set X is the set of possible inputs, which could be more general, e.g. RD. For notational convenience we use the arbitrary enumeration of the cases in the training set to identify the random variables such that fifxi is the random variable corresponding to the case xi,yi as would be expected.
A Gaussian process is defined as a collection of random variables. Thus, the definition automatically implies a consistency requirement, which is also some times known as the marginalization property. This property simply means that if the GP e.g. specifies y1, y2N , , then it must also specify y1N 1 , 11where 11 is the relevant submatrix of , see eq. A.6. In other words, examination of a larger set of variables does not change the distribution of the smaller set. Notice that the consistency requirement is au tomatically fulfilled if the covariance function specifies entries of the covariance matrix.5 The definition does not exclude Gaussian processes with finite index sets which would be simply Gaussian distributions, but these are not partic ularly interesting for our purposes.
5Note, however, that if you instead specified e.g. a function for the entries of the inverse covariance matrix, then the marginalization property would no longer be fulfilled, and one could not think of this as a consistent collection of random variablesthis would not qualify as a Gaussian process.
index setinput domain
marginalization property
finite index set
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
14
Bayesian linear model is a Gaussian process
Regression
A simple example of a Gaussian process can be obtained from our Bayesian linear regression model fxxw with prior wN0,p. We have for the mean and covariance
EfxxEw0,
EfxfxxEwwxxpx.
2.15
basis functions
Thus fx and fx are jointly Gaussian with zero mean and covariance given by xpx. Indeed, the function values fx1, . . . , fxn corresponding to any number of input points n are jointly Gaussian, although if Nn then this Gaussian is singular as the joint covariance matrix will be of rank N.
In this chapter our running example of a covariance function will be the squared exponential6 SE covariance function; other covariance functions are discussed in chapter 4. The covariance function specifies the covariance between pairs of random variables
cov fxp, fxqkxp, xqexp1 xpxq2. 2.16 2
Note, that the covariance between the outputs is written as a function of the inputs. For this particular covariance function, we see that the covariance is almost unity between variables whose corresponding inputs are very close, and decreases as their distance in the input space increases.
It can be shown see section 4.3.1 that the squared exponential covariance function corresponds to a Bayesian linear regression model with an infinite number of basis functions. Indeed for every positive definite covariance function k,, there exists a possibly infinite expansion in terms of basis functions see Mercers theorem in section 4.3. We can also obtain the SE covariance function from the linear combination of an infinite number of Gaussianshaped basis functions, see eq. 4.13 and eq. 4.30.
The specification of the covariance function implies a distribution over func tions. To see this, we can draw samples from the distribution of functions evalu ated at any number of points; in detail, we choose a number of input points,7 X and write out the corresponding covariance matrix using eq. 2.16 elementwise. Then we generate a random Gaussian vector with this covariance matrix
fN 0, KX, X, 2.17
and plot the generated values as a function of the inputs. Figure 2.2a shows three such samples. The generation of multivariate Gaussian samples is de scribed in section A.2.
In the example in Figure 2.2 the input values were equidistant, but this need not be the case. Notice that informally the functions look smooth. In fact the squared exponential covariance function is infinitely differentiable, leading to the process being infinitely meansquare differentiable see section 4.1. We also see that the functions seem to have a characteristic lengthscale,
6Sometimes this covariance function is called the Radial Basis Function RBF or Gaussian; here we prefer squared exponential.
7Technically, these input points play the role of test inputs and therefore carry a subscript asterisk; this will become clearer later when both training and test points are involved.
smoothness
characteristic lengthscale
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.2 Functionspace View
22 11 00
1 1
2 2
5 0 55 0 5
input, x input, x a, prior b, posterior
Figure 2.2: Panel a shows three functions drawn at random from a GP prior; the dots indicate values of y actually generated; the two other functions have less correctly been drawn as lines by joining a large number of evaluated points. Panel b shows three random functions drawn from the posterior, i.e. the prior conditioned on the five noise free observations indicated. In both plots the shaded area represents the pointwise mean plus and minus two times the standard deviation for each input value corresponding to the 95 confidence region, for the prior and posterior respectively.
which informally can be thought of as roughly the distance you have to move in input space before the function value can change significantly, see section 4.2.1. For eq. 2.16 the characteristic lengthscale is around one unit. By replacing xpxq by xpxql in eq. 2.16 for some positive constant l we could change the characteristic lengthscale of the process. Also, the overall variance of the random function can be controlled by a positive prefactor before the exp in eq. 2.16. We will discuss more about how such factors affect the predictions in section 2.3, and say more about how to set such scale parameters in chapter 5.
Prediction with Noisefree Observations
We are usually not primarily interested in drawing random functions from the prior, but want to incorporate the knowledge that the training data provides about the function. Initially, we will consider the simple special case where the observations are noise free, that is we know xi , fi i1, . . . , n. The joint distribution of the training outputs, f, and the test outputs f according to the prior is
f KX,X KX,X
fN 0, KX ,X KX ,X. 2.18
15
If there are n training points and n test points then KX,X denotes the nn matrix of the covariances evaluated at all pairs of training and test points, and similarly for the other entries KX, X, KX, X and KX, X. To get the posterior distribution over functions we need to restrict this joint prior distribution to contain only those functions which agree with the observed data points. Graphically in Figure 2.2 you may think of generating functions from the prior, and rejecting the ones that disagree with the observations, al
graphical rejection
magnitude
joint prior
output, fx
output, fx
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
16
noisefree predictive distribution
Regression
though this strategy would not be computationally very efficient. Fortunately, in probabilistic terms this operation is extremely simple, corresponding to con ditioning the joint Gaussian prior distribution on the observations see section A.2 for further details to give
fX,X,fNKX,XKX,X1f,
KX, XKX, XKX, X1KX, X. 2.19
Function values f corresponding to test inputs X can be sampled from the joint posterior distribution by evaluating the mean and covariance matrix from eq. 2.19 and generating samples according to the method described in section A.2.
Figure 2.2b shows the results of these computations given the five data points marked withsymbols. Notice that it is trivial to extend these compu tations to multidimensional inputsone simply needs to change the evaluation of the covariance function in accordance with eq. 2.16, although the resulting functions may be harder to display graphically.
Prediction using Noisy Observations
It is typical for more realistic modelling situations that we do not have access to function values themselves, but only noisy versions thereof yfx.8 Assuming additive independent identically distributed Gaussian noisewith variance n2 , the prior on the noisy observations becomes
covyp,yqkxp,xqn2pq or covyKX,Xn2I, 2.20
where pq is a Kronecker delta which is one iff pq and zero otherwise. It follows from the independence9 assumption about the noise, that a diagonal matrix10 is added, in comparison to the noise free case, eq. 2.16. Introducing the noise term in eq. 2.18 we can write the joint distribution of the observed target values and the function values at the test locations under the prior as
yKX,Xn2I KX,X
f N0, KX,X KX,X . 2.21
predictive distribution
Deriving the conditional distribution corresponding to eq. 2.19 we arrive at the key predictive equations for Gaussian process regression
fX, y, XNf, covf, where
fEfX, y, XKX, XKX, Xn2 I1y,
covfKX, XKX, XKX, Xn2 I1KX, X.
2.22 2.23 2.24
8There are some situations where it is reasonable to assume that the observations are noisefree, for example for computer simulations, see e.g. Sacks et al. 1989.
9More complicated noise models with nontrivial covariance structure can also be handled, see section 9.2.
10Notice that the Kronecker delta is on the index of the cases, not the value of the input; for the signal part of the covariance function the input value is the index set to the random variables describing the function, for the noise part it is the identity of the point.
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.2 Functionspace View
Observations
Gaussian field
Inputs
17
y
TTTT
y1
f1 f
fc
TTTT TTT
Figure 2.3: Graphical model chain graph for a GP for regression. Squares rep resent observed variables and circles represent unknowns. The thick horizontal bar represents a set of fully connected nodes. Note that an observation yi is conditionally independent of all other nodes given the corresponding latent variable, fi. Because of the marginalization property of GPs addition of further inputs, x, latent variables, f, and unobserved targets, y, does not change the distribution of any other variables.
Notice that we now have exact correspondence with the weight space view in eq. 2.12 when identifying KC, DCpD, where C, D stand for ei ther X or X. For any set of basis functions, we can compute the corresponding covariance function as kxp,xqxppxq; conversely, for every posi tive definite covariance function k, there exists a possibly infinite expansion in terms of basis functions, see section 4.3.
The expressions involving KX, X, KX, X and KX, X etc. can look rather unwieldy, so we now introduce a compact form of the notation setting KKX,X and KKX,X. In the case that there is only one test point x we write kxk to denote the vector of covariances between the test point and the n training points. Using this compact notation and for a single test point x, equations 2.23 and 2.24 reduce to
f kK2I1y, 2.25 n
Vfkx, xk Kn2 I1k. 2.26
Let us examine the predictive distribution as given by equations 2.25 and 2.26. Note first that the mean prediction eq. 2.25 is a linear combination of obser vations y; this is sometimes referred to as a linear predictor. Another way to look at this equation is to see it as a linear combination of n kernel functions, each one centered on a training point, by writing
correspondence with weightspace view
compact notation
predictive distribution linear predictor
representer theorem
fx
n
i1
ikxi,x
2.27
where K n2I1y. The fact that the mean prediction for fx can be written as eq. 2.27 despite the fact that the GP can be represented in terms of a possibly infinite number of basis functions is one manifestation of the representer theorem; see section 6.2 for more on this point. We can understand this result intuitively because although the GP defines a joint Gaussian dis tribution over all of the y variables, one for each point in the index set X, for
yc
x1
x2
x
xc
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
18
Regression
x2 x1 x3
2 0.6
1
0
1
0.4
0.2
0
2
5 0 55 0 5
input, x input, x
a, posterior b, posterior covariance
Figure 2.4: Panel a is identical to Figure 2.2b showing three random functions drawn from the posterior. Panel b shows the posterior covariance between fx and fx for the same data for three different values of x. Note, that the covariance at close points is high, falling to zero at the training points where there is no variance, since it is a noisefree process, then becomes negative, etc. This happens because if the smooth function happens to be less than the mean on one side of the data point, it tends to exceed the mean on the other side, causing a reversal of the sign of the covariance at the data points. Note for contrast that the prior covariance is simply of Gaussian shape and never negative.
making predictions at x we only care about the n1dimensional distribution defined by the n training points and the test point. As a Gaussian distribu tion is marginalized by just taking the relevant block of the joint covariance matrix see section A.2 it is clear that conditioning this n1dimensional distribution on the observations gives us the desired result. A graphical model representation of a GP is given in Figure 2.3.
Note also that the variance in eq. 2.24 does not depend on the observed targets, but only on the inputs; this is a property of the Gaussian distribution. The variance is the difference between two terms: the first term KX,X is simply the prior covariance; from that is subtracted a positive term, repre senting the information the observations gives us about the function. We can very simply compute the predictive distribution of test targets y by adding n2I to the variance in the expression for covf.
The predictive distribution for the GP model gives more than just pointwise errorbars of the simplified eq. 2.26. Although not stated explicitly, eq. 2.24 holds unchanged when X denotes multiple test inputs; in this case the co variance of the test targets are computed whose diagonal elements are the pointwise variances. In fact, eq. 2.23 is the mean function and eq. 2.24 the covariance function of the Gaussian posterior process; recall the definition of Gaussian process from page 13. The posterior covariance in illustrated in Figure 2.4b.
It will be useful particularly for chapter 5 to introduce the marginal likeli hood or evidence pyX at this point. The marginal likelihood is the integral
0.2
noisy predictions joint predictions
posterior process
marginal likelihood
output, fx
post. covariance, covfx,fx
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.3 Varying the Hyperparameters 19
input: X inputs, y targets, k covariance function, n2 noise level, x test input
2: L:choleskyKn2I: LLy
4: f:k
v : Lk
6: Vf:kx,xvv
predictive mean eq. 2.25 predictive variance eq. 2.26
eq.2.30 8: return: f mean, Vf variance, logpyX log marginal likelihood
logpyX:1y logLiinlog22i2
Algorithm 2.1: Predictions and log marginal likelihood for Gaussian process regres sion. The implementation addresses the matrix inversion required by eq. 2.25 and 2.26 using Cholesky factorization, see section A.4. For multiple test cases lines 46 are repeated. The log determinant required in eq. 2.30 is computed from the Cholesky factor for large n it may not be possible to represent the determinant itself. The computational complexity is n36 for the Cholesky decomposition in line 2, and n22 for solving triangular systems in line 3 and for each test case in line 5.
of the likelihood times the prior
pyX
pyf,XpfXdf. 2.28
The term marginal likelihood refers to the marginalization over the function values f. Under the Gaussian process model the prior is Gaussian, fXN0,K, or
logpfX1fK1f1 logK n log2, 2.29 222
and the likelihood is a factorized Gaussian yfNf,n2I so we can make use of equations A.7 and A.8 to perform the integration yielding the log marginal likelihood
logpyX1yK2I1y1logK2Inlog2. 2.30 2n2n2
This result can also be obtained directly by observing that yN 0, Kn2 I.
A practical implementation of Gaussian process regression GPR is shown in Algorithm 2.1. The algorithm uses Cholesky decomposition, instead of di rectly inverting the matrix, since it is faster and numerically more stable, see section A.4. The algorithm returns the predictive mean and variance for noise free test datato compute the predictive distribution for noisy test data y, simply add the noise variance n2 to the predictive variance of f.
2.3 Varying the Hyperparameters
Typically the covariance functions that we use will have some free parameters. For example, the squaredexponential covariance function in one dimension has the following form
kyxp,xqf2 exp 1 xp xq2n2pq. 2.31 2l2
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
20
Regression
3 2 1 0
1 2 3
output, y
output, y
output, y
a, l1 33
22 11 00
1 1 2 2 3 3
5 0 5
input, x
b, l0.3
5 0 5
input, x
c, l3
5 0 5
input, x
hyperparameters
Figure 2.5: a Data is generated from a GP with hyperparameters l,f,n1, 1, 0.1, as shown by thesymbols. Using Gaussian process prediction with these hyperparameters we obtain a 95 confidence region for the underlying function f shown in grey. Panels b and c again show the 95 confidence region, but this time for hyperparameter values 0.3, 1.08, 0.00005 and 3.0, 1.16, 0.89 respectively.
The covariance is denoted ky as it is for the noisy targets y rather than for the underlying function f. Observe that the lengthscale l, the signal variance f2 and the noise variance n2 can be varied. In general we call the free parameters hyperparameters.11
In chapter 5 we will consider various methods for determining the hyperpa rameters from training data. However, in this section our aim is more simply to explore the effects of varying the hyperparameters on GP prediction. Consider the data shown bysigns in Figure 2.5a. This was generated from a GP with the SE kernel with l, f , n1, 1, 0.1. The figure also shows the 2 standarddeviation error bars for the predictions obtained using these values of the hyperparameters, as per eq. 2.24. Notice how the error bars get larger for input values that are distant from any training points. Indeed if the xaxis
11We refer to the parameters of the covariance function as hyperparameters to emphasize that they are parameters of a nonparametric model; in accordance with the weightspace view, section 2.1, the parameters weights of the underlying parametric model have been integrated out.
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.4 Decision Theory for Regression
were extended one would see the error bars reflect the prior standard deviation of the process f away from the data.
If we set the lengthscale shorter so that l0.3 and kept the other pa rameters the same, then generating from this process we would expect to see plots like those in Figure 2.5a except that the xaxis should be rescaled by a factor of 0.3; equivalently if the same xaxis was kept as in Figure 2.5a then a sample function would look much more wiggly.
If we make predictions with a process with l0.3 on the data generated from the l1 process then we obtain the result in Figure 2.5b. The remaining two parameters were set by optimizing the marginal likelihood, as explained in chapter 5. In this case the noise parameter is reduced to n0.00005 as the greater flexibility of the signal means that the noise level can be reduced. This can be observed at the two datapoints near x2.5 in the plots. In Figure 2.5a l1 these are essentially explained as a similar function value with differing noise. However, in Figure 2.5b l0.3 the noise level is very low, so these two points have to be explained by a sharp variation in the value of the underlying function f. Notice also that the short lengthscale means that the error bars in Figure 2.5b grow rapidly away from the datapoints.
In contrast, we can set the lengthscale longer, for example to l3, as shown in Figure 2.5c. Again the remaining two parameters were set by optimizing the marginal likelihood. In this case the noise level has been increased to n0.89 and we see that the data is now explained by a slowly varying function with a lot of noise.
Of course we can take the position of a quicklyvarying signal with low noise, or a slowlyvarying signal with high noise to extremes; the former would give rise to a whitenoise process model for the signal, while the latter would give rise to a constant signal with added white noise. Under both these models the datapoints produced should look like white noise. However, studying Figure 2.5a we see that white noise is not a convincing model of the data, as the sequence of ys does not alternate sufficiently quickly but has correlations due to the variability of the underlying function. Of course this is relatively easy to see in one dimension, but methods such as the marginal likelihood discussed in chapter 5 generalize to higher dimensions and allow us to score the various models. In this case the marginal likelihood gives a clear preference for l, f , n1, 1, 0.1 over the other two alternatives.
2.4 Decision Theory for Regression
In the previous sections we have shown how to compute predictive distributions for the outputs y corresponding to the novel test input x. The predictive dis tribution is Gaussian with mean and variance given by eq. 2.25 and eq. 2.26. In practical applications, however, we are often forced to make a decision about how to act, i.e. we need a pointlike prediction which is optimal in some sense. To this end we need a loss function, Lytrue,yguess, which specifies the loss or
21
too short lengthscale
too long lengthscale
model comparison
optimal predictions loss function
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
22
nonBayesian paradigm Bayesian paradigm
expected loss, risk
absolute error loss squared error loss
Regression
penalty incurred by guessing the value yguess when the true value is ytrue. For example, the loss function could equal the absolute deviation between the guess and the truth.
Notice that we computed the predictive distribution without reference to the loss function. In nonBayesian paradigms, the model is typically trained by minimizing the empirical risk or loss. In contrast, in the Bayesian setting there is a clear separation between the likelihood function used for training, in addition to the prior and the loss function. The likelihood function describes how the noisy measurements are assumed to deviate from the underlying noise free function. The loss function, on the other hand, captures the consequences of making a specific choice, given an actual true state. The likelihood and loss function need not have anything in common.12
Our goal is to make the point prediction yguess which incurs the smallest loss, but how can we achieve that when we dont know ytrue? Instead, we minimize the expected loss or risk, by averaging w.r.t. our models opinion as to what the truth might be
robot arm
In general the value of yguess that minimizes the risk for the loss function yguess y is the median of pyx, D, while for the squared loss yguessy2 it is the mean of this distribution. When the predictive distribution is Gaussian the mean and the median coincide, and indeed for any symmetric loss function and symmetric predictive distribution we always get yguess as the mean of the predictive distribution. However, in many practical problems the loss functions can be asymmetric, e.g. in safety critical applications, and point predictions may be computed directly from eq. 2.32 and eq. 2.33. A comprehensive treatment of decision theory can be found in Berger 1985.
2.5 An Example Application
In this section we use Gaussian process regression to learn the inverse dynamics of a seven degreesoffreedom SARCOS anthropomorphic robot arm. The task is to map from a 21dimensional input space 7 joint positions, 7 joint velocities, 7 joint accelerations to the corresponding 7 joint torques. This task has pre viously been used to study regression algorithms by Vijayakumar and Schaal 2000, Vijayakumar et al. 2002 and Vijayakumar et al. 2005.13 Following
12Beware of fallacious arguments like: a Gaussian likelihood implies a squared error loss function.
13We thank Sethu Vijayakumar for providing us with the data.
R Lyguessx
Thus our best guess, in the sense that it minimizes the expected loss, is
yoptimalxargmin R Lyguessx. yguess
Ly,yguesspyx,Ddy.
2.32
2.33
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.5 An Example Application
this previous work we present results below on just one of the seven mappings, from the 21 input variables to the first of the seven torques.
One might ask why it is necessary to learn this mapping; indeed there exist physicsbased rigidbodydynamics models which allow us to obtain the torques from the position, velocity and acceleration variables. However, the real robot arm is actuated hydraulically and is rather lightweight and compliant, so the assumptions of the rigidbodydynamics model are violated as we see below. It is worth noting that the rigidbodydynamics model is nonlinear, involving trigonometric functions and squares of the input variables.
An inverse dynamics model can be used in the following manner: a planning module decides on a trajectory that takes the robot from its start to goal states, and this specifies the desired positions, velocities and accelerations at each time. The inverse dynamics model is used to compute the torques needed to achieve this trajectory and errors are corrected using a feedback controller.
The dataset consists of 48,933 inputoutput pairs, of which 44,484 were used as a training set and the remaining 4,449 were used as a test set. The inputs were linearly rescaled to have zero mean and unit variance on the training set. The outputs were centered so as to have zero mean on the training set.
Given a prediction method, we can evaluate the quality of predictions in several ways. Perhaps the simplest is the squared error loss, where we compute
2
the squared residual y fx between the mean prediction and the target
at each test point. This can be summarized by the mean squared error MSE, by averaging over the test set. However, this quantity is sensitive to the overall scale of the target values, so it makes sense to normalize by the variance of the targets of the test cases to obtain the standardized mean squared error SMSE.
This causes the trivial method of guessing the mean of the training targets to have a SMSE of approximately 1.
Additionally if we produce a predictive distribution at each test input we can evaluate the negative log probability of the target under the model.14 As GPR produces a Gaussian predictive density, one obtains
2 logpyD,x1log22yfx , 2.34
where the predictive variance 2 for GPR is computed as 2Vfn2, where Vf is given by eq. 2.26; we must include the noise variance n2 as we are predicting the noisy target y. This loss can be standardized by subtracting the loss that would be obtained under the trivial model which predicts using a Gaussian with the mean and variance of the training data. We denote this the standardized log loss SLL. The mean SLL is denoted MSLL. Thus the MSLL will be approximately zero for simple methods and negative for better methods.
A number of models were tested on the data. A linear regression LR model provides a simple baseline for the SMSE. By estimating the noise level from the
14It makes sense to use the negative log probability so as to obtain a loss, not a utility.
23
why learning?
MSE
SMSE
2 22
MSLL
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
24
Regression
Table 2.1: Test results on the inverse dynamics problem for a number of different methods. Thedenotes a missing entry, caused by two methods not producing full predictive distributions, so MSLL could not be evaluated.
residuals on the training set one can also obtain a predictive variance and thus get a MSLL value for LR. The rigidbodydynamics RBD model has a number of free parameters; these were estimated by Vijayakumar et al. 2005 using a leastsquares fitting procedure. We also give results for the locally weighted projection regression LWPR method of Vijayakumar et al. 2005 which is an online method that cycles through the dataset multiple times. For the GP models it is computationally expensive to make use of all 44,484 training cases due to the On3 scaling of the basic algorithm. In chapter 8 we present several different approximate GP methods for large datasets. The result given in Table 2.1 was obtained with the subset of regressors SR approximation with a subset size of 4096. This result is taken from Table 8.1, which gives full results of the various approximation methods applied to the inverse dynamics problem. The squared exponential covariance function was used with a separate lengthscale parameter for each of the 21 input dimensions, plus the signal and noise variance parameters f2 and n2 . These parameters were set by optimizing the marginal likelihood eq. 2.30 on a subset of the data see also chapter 5.
The results for the various methods are presented in Table 2.1. Notice that the problem is quite nonlinear, so the linear regression model does poorly in comparison to nonlinear methods.15 The nonlinear method LWPR improves over linear regression, but is outperformed by GPR.
2.6 Smoothing, Weight Functions and Equiva lent Kernels
Gaussian process regression aims to reconstruct the underlying signal f by removing the contaminating noise . To do this it computes a weighted average
21
of the noisy observations y as fxkx KnI y; as fx is a linear
combination of the y values, Gaussian process regression is a linear smoother see Hastie and Tibshirani 1990, sec. 2.8 for further details. In this section we study smoothing first in terms of a matrix analysis of the predictions at the training points, and then in terms of the equivalent kernel.
15It is perhaps surprising that RBD does worse than linear regression. However, Stefan Schaal pers. comm., 2004 states that the RBD parameters were optimized on a very large dataset, of which the training data used here is subset, and if the RBD model were optimized w.r.t. this training set one might well expect it to outperform linear regression.
Method
SMSE
MSLL
LR RBD LWPR GPR
0.075 0.104 0.040 0.011
1.29
2.25
linear smoother
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.6 Smoothing, Weight Functions and Equivalent Kernels
The predicted mean valuesf at the training points are given by
fKKn2I1y. 2.35
Let K have the eigendecomposition Kni1iuiui , where i is the ith eigenvalue and ui is the corresponding eigenvector. As K is real and sym metric positive semidefinite, its eigenvalues are real and nonnegative, and its eigenvectors are mutually orthogonal. Let yni1iui for some coefficients iui y. Then
fn iiui. 2.36 i1 in2
Notice that if iin2 1 then the component in y along ui is effectively eliminated. For most covariance functions that are used in practice the eigen values are larger for more slowly varying eigenvectors e.g. fewer zerocrossings so that this means that highfrequency components in y are smoothed out. The effective number of parameters or degrees of freedom of the smoother is defined as trKKn2I1ni1 iin2, see Hastie and Tibshirani 1990, sec. 3.5. Notice that this counts the number of eigenvectors which are not eliminated.
25
eigendecomposition
We can define a vector of functions hxKn2I1kx. Thus we
degrees of freedom
weight function
have fxhx y, making it clear that the mean prediction at a point x is a linear combination of the target values y. For a fixed test point x, hx gives the vector of weights applied to targets y. hx is called the weight function Silverman, 1984. As Gaussian process regression is a linear smoother, the weight function does not depend on y. Note the difference between a linear model, where the prediction is a linear combination of the inputs, and a linear smoother, where the prediction is a linear combination of the training set targets.
Understanding the form of the weight function is made complicated by the matrix inversion of K n2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be smeared out in xspace at some density of observations. In this case analytic tools can be brought to bear on the problem, as shown in section 7.1. By analogy to kernel smoothing, Silverman 1984 called the idealized weight function the equivalent kernel; see also Girosi et al. 1995, sec. 2.1.
A kernel smoother centres a kernel function16on x and then computes
ixixl for each data point xi, yi, where l is a lengthscale. The
equivalent kernel kernel smoother
Gaussian is a commonly used kernel function. The prediction for fx is n n
computed as fxi1wiyi where wii j1 j. This is also known as the NadarayaWatson estimator, see e.g. Scott 1992, sec. 8.1.
The weight function and equivalent kernel for a Gaussian process are illus trated in Figure 2.6 for a onedimensional input variable x. We have used the squared exponential covariance function and have set the lengthscale l0.0632 so that l20.004. There are n50 training points spaced randomly along
16Note that this kernel function does not need to be a valid covariance function.
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
26
Regression
0.2
0
0 0.2 0.4 0.6 0.8 1
0.2
0
0 0.2 0.4 0.6 0.8 1
a b
0.1
0
0 0.2 0.4 0.6 0.8 1
10 250
0.05
0
0 0.2 0.4 0.6 0.8 1
c d
Figure 2.6: Panels ac show the weight function hx dots corresponding to the n50 training points, the equivalent kernel solid and the original squared exponential kernel dashed. Panel d shows the equivalent kernels for two different data densities. See text for further details. The small cross at the test point is to scale in all four plots.
the xaxis. Figures 2.6a and 2.6b show the weight function and equivalent kernel for x0.5 and x0.05 respectively, for n20.1. Figure 2.6c is also for x0.5 but uses n210. In each case the dots correspond to the weight function hx and the solid line is the equivalent kernel, whose construction is explained below. The dashed line shows a squared exponential kernel centered on the test point, scaled to have the same height as the maximum value in the equivalent kernel. Figure 2.6d shows the variation in the equivalent kernel as a function of n, the number of datapoints in the unit interval.
Many interesting observations can be made from these plots. Observe that the equivalent kernel has in general a shape quite different to the original SE kernel. In Figure 2.6a the equivalent kernel is clearly oscillatory with negative sidelobes and has a higher spatial frequency than the original kernel. Figure 2.6b shows similar behaviour although due to edge effects the equivalent kernel is truncated relative to that in Figure 2.6a. In Figure 2.6c we see that at higher noise levels the negative sidelobes are reduced and the width of the equivalent kernel is similar to the original kernel. Also note that the overall height of the equivalent kernel in c is reduced compared to that in a and
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.7 Incorporating Explicit Basis Functions
bit averages over a wider area. The more oscillatory equivalent kernel for lower noise levels can be understood in terms of the eigenanalysis above; at higher noise levels only the largeslowly varying components of y remain, while for smaller noise levels the more oscillatory components are also retained.
In Figure 2.6d we have plotted the equivalent kernel for n10 and n250 datapoints in 0, 1; notice how the width of the equivalent kernel decreases as n increases. We discuss this behaviour further in section 7.1.
The plots of equivalent kernels in Figure 2.6 were made by using a dense
grid of ngrid points on 0,1 and then computing the smoother matrix KK
2 I1. Each row of this matrix is the equivalent kernel at the appropriate grid
location. However, in order to get the scaling right one has to set 2grid
n2ngridn; for ngridn this means that the effective variance at each of the
ngrid points is larger, but as there are correspondingly more points this effect
cancels out. This can be understood by imagining the situation if there were
ngridn independent Gaussian observations with variance 2 at a single x grid
position; this would be equivalent to one Gaussian observation with variance n2. In effect the n observations have been smoothed out uniformly along the interval. The form of the equivalent kernel can be obtained analytically if we go to the continuum limit and look to smooth a noisy function. The relevant theory and some example equivalent kernels are given in section 7.1.
2.7 Incorporating Explicit Basis Functions
It is common but by no means necessary to consider GPs with a zero mean func tion. Note that this is not necessarily a drastic limitation, since the mean of the posterior process is not confined to be zero. Yet there are several reasons why one might wish to explicitly model a mean function, including interpretability of the model, convenience of expressing prior information and a number of an alytical limits which we will need in subsequent chapters. The use of explicit basis functions is a way to specify a nonzero mean over functions, but as we will see in this section, one can also use them to achieve other interesting effects.
Using a fixed deterministic mean function mx is trivial: Simply apply the usual zero mean GP to the difference between the observations and the fixed mean function. With
fxGPmx, kx,x, 2.37 the predictive mean becomes
fmX KX , XK1ymX, 2.38 y
where KyKn2I, and the predictive variance remains unchanged from eq. 2.24.
However, in practice it can often be difficult to specify a fixed mean function. In many cases it may be more convenient to specify a few fixed basis functions,
27
fixed mean function
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
28
stochastic mean function
polynomial regression
Regression
whose coefficients, , are to be inferred from the data. Consider
gxfxhx, where fxGP0,kx,x, 2.39
here fx is a zero mean GP, hx are a set of fixed basis functions, andare additional parameters. This formulation expresses that the data is close to a global linear model with the residuals being modelled by a GP. This idea was explored explicitly as early as 1975 by Blight and Ott 1975, who used the GP to model the residuals from a polynomial regression, i.e. hx1, x, x2, . . .. When fitting the model, one could optimize over the parametersjointly with the hyperparameters of the covariance function. Alternatively, if we take the prior onto be Gaussian, Nb,B, we can also integrate out these parameters. Following OHagan 1978 we obtain another GP
gxGPhxb, kx, xhxBhx, 2.40
now with an added contribution in the covariance function caused by the un certainty in the parameters of the mean. Predictions are made by plugging the mean and covariance functions of gx into eq. 2.39 and eq. 2.24. After rearranging, we obtain
g X H KK1yH fX R, y
2.41 where the H matrix collects the hx vectors for all training and H all test
cases,B1HK1H1HK1yB1b, and RHHK1K . yyy
Notice the nice interpretation of the mean expression, eq. 2.41 top line: is the mean of the global linear model parameters, being a compromise between the data term and prior, and the predictive mean is simply the mean linear output plus what the GP model predicts from the residuals. The covariance is the sum of the usual covariance term and a new nonnegative contribution.
Exploring the limit of the above expressions as the prior on theparam eter becomes vague, B1O where O is the matrix of zeros, we obtain a predictive distribution which is independent of b
covg covf RB1HK1H1R, y
g XfXR ,
covg covf RHK1H1R,
2.42 where the limitingHK1H1HK1y. Notice that predictions under
y yy
the limit B1O should not be implemented na vely by plugging the modified covariance function from eq. 2.40 into the standard prediction equations, since the entries of the covariance function tend to infinity, thus making it unsuitable for numerical implementation. Instead eq. 2.42 must be used. Even if the nonlimiting case is of interest, eq. 2.41 is numerically preferable to a direct implementation based on eq. 2.40, since the global linear part will often add some very large eigenvalues to the covariance matrix, affecting its condition number.
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.8 History and Related Work
2.7.1 Marginal Likelihood
In this short section we briefly discuss the marginal likelihood for the model with a Gaussian prior N b, B on the explicit parameters from eq. 2.40, as this will be useful later, particularly in section 6.3.1. We can express the marginal likelihood from eq. 2.30 as
29
log pyX, b, B 1 HbyKyHBH1Hby 2
1 logKy HBH n log2, 22
2.43
where we have included the explicit mean. We are interested in exploring the limit where B1O, i.e. when the prior is vague. In this limit the mean of the prior is irrelevant as was the case in eq. 2.42, so without loss of generality for the limiting case we assume for now that the mean is zero, b0, giving
logpyX,b0,B 1yK1y 1yCy
2 y 2 2.44
1 logKy 1 logB 1 logA n log2, 2222
where AB1HK1H and CK1HA1HK1 and we have used yyy
the matrix inversion lemma, eq. A.9 and eq. A.10.
We now explore the behaviour of the log marginal likelihood in the limit of
vague priors on . In this limit the variances of the Gaussian in the directions
spanned by columns of H will become infinite, and it is clear that this will
require special treatment. The log marginal likelihood consists of three terms:
a quadratic form in y, a log determinant term, and a term involving log2.
Performing an eigendecomposition of the covariance matrix we see that the
contributions to quadratic form term from the infinitevariance directions will
be zero. However, the log determinant term will tend to minus infinity. The
standard solution Wahba, 1985, Ansley and Kohn, 1985 in this case is to
project y onto the directions orthogonal to the span of H and compute the
marginal likelihood in this subspace. Let the rank of H be m. Then as
shown in Ansley and Kohn 1985 this means that we must discard the terms
1 log Bm log 2 from eq. 2.44 to give 22
logpyX1yK1y 1yCy 1 logK1 logA nm log2, 2y22y22
where AHK1H and CK1HA1HK1. yyy
2.8 History and Related Work
Prediction with Gaussian processes is certainly not a very recent topic, espe cially for time series analysis; the basic theory goes back at least as far as the work of Wiener 1949 and Kolmogorov 1941 in the 1940s. Indeed Lauritzen 1981 discusses relevant work by the Danish astronomer T. N. Thiele dating from 1880.
time series
2.45
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
30
geostatistics kriging
computer experiments
machine learning
Regression
Gaussian process prediction is also well known in the geostatistics field see, e.g. Matheron, 1973; Journel and Huijbregts, 1978 where it is known as krig ing,17 and in meteorology Thompson, 1956, Daley, 1991 although this litera ture naturally has focussed mostly on two and threedimensional input spaces. Whittle 1963, sec. 5.4 also suggests the use of such methods for spatial pre diction. Ripley 1981 and Cressie 1993 provide useful overviews of Gaussian process prediction in spatial statistics.
Gradually it was realized that Gaussian process prediction could be used in a general regression context. For example OHagan 1978 presents the general theory as given in our equations 2.23 and 2.24, and applies it to a number of onedimensional regression problems. Sacks et al. 1989 describe GPR in the context of computer experiments where the observations y are noise free and discuss a number of interesting directions such as the optimization of parameters in the covariance function see our chapter 5 and experimental design i.e. the choice of xpoints that provide most information on f. The authors describe a number of computer simulations that were modelled, including an example where the response variable was the clock asynchronization in a circuit and the inputs were six transistor widths. Santner et al. 2003 is a recent book on the use of GPs for the design and analysis of computer experiments.
Williams and Rasmussen 1996 described Gaussian process regression in a machine learning context, and described optimization of the parameters in the covariance function, see also Rasmussen 1996. They were inspired to use Gaussian process by the connection to infinite neural networks as described in section 4.2.3 and in Neal 1996. The kernelization of linear ridge regression described above is also known as kernel ridge regression see e.g. Saunders et al. 1998.
Relationships between Gaussian process prediction and regularization the ory, splines, support vector machines SVMs and relevance vector machines RVMs are discussed in chapter 6.
2.9 Exercises
1. Replicate the generation of random functions from Figure 2.2. Use a regular or random grid of scalar inputs and the covariance function from eq. 2.16. Hints on how to generate random samples from multivariate Gaussian distributions are given in section A.2. Invent some training data points, and make random draws from the resulting GP posterior using eq. 2.19.
2. In eq. 2.11 we saw that the predictive variance at x under the feature space regression model was varfxxA1x. Show that covfx,fxxA1x. Checkthatthisiscompatiblewith the expression given in eq. 2.24.
17Matheron named the method after the South African mining engineer D. G. Krige.
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
2.9 Exercises 31
3. The Wiener process is defined for x0 and has f00. See sec tion B.2.1 for further details. It has mean zero and a nonstationary covariance function kx, xminx, x. If we condition on the Wiener process passing through f10 we obtain a process known as the Brow nian bridge or tieddown Wiener process. Show that this process has covariance kx,xminx,xxx for 0x,x1 and mean 0. Write a computer program to draw samples from this process at a finite grid of x points in 0, 1.
4. Let varnfx be the predictive variance of a Gaussian process regres sion model at x given a dataset of size n. The corresponding predictive variance using a dataset of only the first n1 training points is de noted varn1fx. Show that varnfxvarn1fx, i.e. that the predictive variance at x cannot increase as more training data is ob tained. One way to approach this problem is to use the partitioned matrix equations given in section A.3 to decompose varnfxkx, xk Kn2I1k. An alternative information theoretic argument is given in Williams and Vivarelli 2000. Note that while this conclusion is true for Gaussian process priors and Gaussian noise models it does not hold generally, see Barber and Saad 1996.
C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
Reviews
There are no reviews yet.