[SOLVED] R C algorithm parallel database statistic software Bayesian theory C. E. Rasmussen C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml

$25

File Name: R_C_algorithm_parallel_database_statistic_software_Bayesian_theory_C._E._Rasmussen__C._K._I._Williams,_Gaussian_Processes_for_Machine_Learning,_the_MIT_Press,_2006,_ISBN_026218253X._c_2006_Massachusetts_Institute_of_Technology._www.GaussianProcess.orggpml.zip
File Size: 2402.1 KB

5/5 - (1 vote)

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
Chapter 3
Classification
In chapter 2 we have considered regression problems, where the targets are real valued. Another important class of problems is classification1 problems, where we wish to assign an input pattern x to one of C classes, C1,,CC. Practical examples of classification problems are handwritten digit recognition where we wish to classify a digitized image of a handwritten digit into one of ten classes 09, and the classification of objects detected in astronomical sky surveys into stars or galaxies. Information on the distribution of galaxies in the universe is important for theories of the early universe. These examples nicely illustrate that classification problems can either be binary or twoclass, C2 or multiclass C2.
We will focus attention on probabilistic classification, where test predictions take the form of class probabilities; this contrasts with methods which provide only a guess at the class label, and this distinction is analogous to the difference between predictive distributions and point predictions in the regression setting. Since generalization to test cases inherently involves some level of uncertainty, it seems natural to attempt to make predictions in a way that reflects these uncertainties. In a practical application one may well seek a class guess, which can be obtained as the solution to a decision problem, involving the predictive probabilities as well as a specification of the consequences of making specific predictions the loss function.
Both classification and regression can be viewed as function approximation problems. Unfortunately, the solution of classification problems using Gaussian processes is rather more demanding than for the regression problems considered in chapter 2. This is because we assumed in the previous chapter that the likelihood function was Gaussian; a Gaussian process prior combined with a Gaussian likelihood gives rise to a posterior Gaussian process over functions, and everything remains analytically tractable. For classification models, where the targets are discrete class labels, the Gaussian likelihood is inappropriate;2
1In the statistics literature classification is often called discrimination.
2One may choose to ignore the discreteness of the target values, and use a regression treatment, where all targets happen to be say 1 for binary classification. This is known as
binary, multiclass
probabilistic classification
nonGaussian likelihood

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
34
Classification
in this chapter we treat methods of approximate inference for classification, where exact inference is not feasible.3
Section 3.1 provides a general discussion of classification problems, and de scribes the generative and discriminative approaches to these problems. In section 2.1 we saw how Gaussian process regression GPR can be obtained by generalizing linear regression. In section 3.2 we describe an analogue of linear regression in the classification case, logistic regression. In section 3.3 logistic regression is generalized to yield Gaussian process classification GPC using again the ideas behind the generalization of linear regression to GPR. For GPR the combination of a GP prior with a Gaussian likelihood gives rise to a posterior which is again a Gaussian process. In the classification case the likelihood is nonGaussian but the posterior process can be approximated by a GP. The Laplace approximation for GPC is described in section 3.4 for binary classification and in section 3.5 for multiclass classification, and the expecta tion propagation algorithm for binary classification is described in section 3.6. Both of these methods make use of a Gaussian approximation to the posterior. Experimental results for GPC are given in section 3.7, and a discussion of these results is provided in section 3.8.
3.1 Classification Problems
The natural starting point for discussing approaches to classification is the joint probability py, x, where y denotes the class label. Using Bayes theorem this joint probability can be decomposed either as pypxy or as pxpyx. This gives rise to two different approaches to classification problems. The first, which we call the generative approach, models the classconditional distribu tions pxy for yC1, . . . , CC and also the prior probabilities of each class, and then computes the posterior probability for each class using
pyxpypxy . 3.1 Cc1 pCcpxCc
The alternative approach, which we call the discriminative approach, focusses on modelling pyx directly. Dawid 1976 calls the generative and discrimina tive approaches the sampling and diagnostic paradigms, respectively.
To turn both the generative and discriminative approaches into practical methods we will need to create models for either pxy, or pyx respectively.4 These could either be of parametric form, or nonparametric models such as those based on nearest neighbours. For the generative case a simple, com
leastsquares classification, see section 6.5.
3Note, that the important distinction is between Gaussian and nonGaussian likelihoods;
regression with a nonGaussian likelihood requires a similar treatment, but since classification defines an important conceptual and application area, we have chosen to treat it in a separate chapter; for nonGaussian likelihoods in general, see section 9.3.
4For the generative approach inference for py is generally straightforward, being esti mation of a binomial probability in the binary case, or a multinomial probability in the multiclass case.
generative approach
discriminative approach
generative model example

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.1 Classification Problems
mon choice would be to model the classconditional densities with Gaussians: pxCcN c, c. A Bayesian treatment can be obtained by placing appro priate priors on the mean and covariance of each of the Gaussians. However, note that this Gaussian model makes a strong assumption on the form of class conditional density and if this is inappropriate the model may perform poorly.
For the binary discriminative case one simple idea is to turn the output of a regression model into a class probability using a response function the inverse of a link function, which squashes its argument, which can lie in the domain , , into the range 0, 1, guaranteeing a valid probabilistic interpretation.
One example is the linear logistic regression model
pC1xxw, where z1 , 3.2
1expz
which combines the linear model with the logistic response function. Another
common choice of response function is the cumulative density function of a
standard normal distribution z z N x0, 1dx. This approach is known
as probit regression. Just as we gave a Bayesian approach to linear regression in chapter 2 we can take a parallel approach to logistic regression, as discussed in section 3.2. As in the regression case, this model is an important step towards the Gaussian process classifier.
Given that there are the generative and discriminative approaches, which one should we prefer? This is perhaps the biggest question in classification, and we do not believe that there is a right answer, as both ways of writing the joint py, x are correct. However, it is possible to identify some strengths and weaknesses of the two approaches. The discriminative approach is appealing in that it is directly modelling what we want, pyx. Also, density estimation for the classconditional distributions is a hard problem, particularly when x is high dimensional, so if we are just interested in classification then the generative approach may mean that we are trying to solve a harder problem than we need to. However, to deal with missing input values, outliers and unlabelled data points in a principled fashion it is very helpful to have access to px, and this can be obtained from marginalizing out the class label y from the joint as pxy pypxy in the generative approach. A further factor in the choice of a generative or discriminative approach could also be which one is most conducive to the incorporation of any prior information which is available. See Ripley 1996, sec. 2.1 for further discussion of these issues. The Gaussian process classifiers developed in this chapter are discriminative.
3.1.1 Decision Theory for Classification
The classifiers described above provide predictive probabilities pyx for a test input x. However, sometimes one actually needs to make a decision and to do this we need to consider decision theory. Decision theory for the regres sion problem was considered in section 2.4; here we discuss decision theory for classification problems. A comprehensive treatment of decision theory can be found in Berger 1985.
35
discriminative model example
response function
probit regression
generative or discriminative?
missing values

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
36
loss, risk
zeroone loss
asymmetric loss
Bayes classifier decision regions
reject option
Classification
Let Lc,c be the loss incurred by making decision c if the true class is Cc. Usually Lc, c0 for all c. The expected loss5 or risk of taking decision c given x is RLcxc Lc, cpCcx and the optimal decision c is the one that minimizes RLcx. One common choice of loss function is the zeroone loss, where a penalty of one unit is paid for an incorrect classification, and 0 for a correct one. In this case the optimal decision rule is to choose the class Cc that maximizes6 pCcx, as this minimizes the expected error at x. However, the zeroone loss is not always appropriate. A classic example of this is the difference in loss of failing to spot a disease when carrying out a medical test compared to the cost of a false positive on the test, so that Lc, cLc, c.
The optimal classifier using zeroone loss is known as the Bayes classi fier. By this construction the feature space is divided into decision regions R1 , . . . , RC such that a pattern falling in decision region Rc is assigned to class Cc. There can be more than one decision region corresponding to a single class. The boundaries between the decision regions are known as decision surfaces or decision boundaries.
One would expect misclassification errors to occur in regions where the max imum class probability maxj pCjx is relatively low. This could be due to either a region of strong overlap between classes, or lack of training examples within this region. Thus one sensible strategy is to add a reject option so that if maxj pCjx for a thresholdin 0,1 then we go ahead and classify the pattern, otherwise we reject it and leave the classification task to a more sophisticated system. For multiclass classification we could alternatively re quire the gap between the most probable and the second most probable class to exceed , and otherwise reject. Asis varied from 0 to 1 one obtains an error reject curve, plotting the percentage of patterns classified incorrectly against the percentage rejected. Typically the error rate will fall as the rejection rate increases. Hansen et al. 1997 provide an analysis of the errorreject tradeoff.
We have focused above on the probabilistic approach to classification, which involves a twostage approach of first computing a posterior distribution over functions and then combining this with the loss function to produce a decision. However, it is worth noting that some authors argue that if our goal is to eventually make a decision then we should aim to approximate the classification function that minimizes the risk expected loss, which is defined as
risk minimization
RLc

Ly,cxpy,xdydx, 3.3
where py,x is the joint distribution of inputs and targets and cx is a clas sification function that assigns an input pattern x to one of C classes see e.g. Vapnik 1995, ch. 1. As py,x is unknown, in this approach one often then seeks to minimize an objective function which includes the empirical risk ni1 Lyi , cxias well as a regularization term. While this is a reasonable
5In Economics one usually talks of maximizing expected utility rather than minimizing expected loss; loss is negative utility. This suggests that statisticians are pessimists while economists are optimists.
6If more than one class has equal posterior probability then ties can be broken arbitrarily.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.2 Linear Models for Classification
method, we note that the probabilistic approach allows the same inference stage to be reused with different loss functions, it can help us to incorporate prior knowledge on the function andor noise model, and has the advantage of giving probabilistic predictions which can be helpful e.g. for the reject option.
3.2 Linear Models for Classification
In this section we briefly review linear models for binary classification, which form the foundation of Gaussian process classification models in the next sec tion. We follow the SVM literature and use the labels y1 and y1 to distinguish the two classes, although for the multiclass case in section 3.5 we use 01 labels. The likelihood is
py1x,wxw, 3.4
given the weight vector w and z can be any sigmoid7 function. When using the logistic, zz from eq. 3.2, the model is usually called simply logistic regression, but to emphasize the parallels to linear regression we prefer the term linear logistic regression. When using the cumulative Gaussian zz, we call the model linear probit regression.
As the probability of the two classes must sum to 1, we have py1x, w1py1x, w. Thus for a data point xi, yi the likelihood is given by xi w if yi1, and 1xi w if yi1. For symmetric likelihood functions, such as the logistic or probit where z1z, this can be written more concisely as
pyixi,wyifi, 3.5
where fifxixi w. Defining the logit transformation as logitxlog py1xpy1x we see that the logistic regression model can be written as logitxxw. The logitx function is also called the log odds ratio. Generalized linear modelling McCullagh and Nelder, 1983 deals with the issue of extending linear models to nonGaussian data scenarios; the logit transformation is the canonical link function for binary data and this choice simplifies the algebra and algorithms.
Given a dataset Dxi, yii1, . . . , n, we assume that the labels are generated independently, conditional on fx. Using the same Gaussian prior wN0, p as for regression in eq. 2.4 we then obtain the unnormalized log posterior
logpwX,y2w p w
with mean and covariance as given in eq. 2.8. For classification the posterior
7A sigmoid function is a monotonically increasing function mapping from R to 0,1. It derives its name from being shaped like a letter S.
37
c 11 n
logyifi. 3.6 In the linear regression case with Gaussian noise the posterior was Gaussian
linear logistic regression
linear probit regression
concise notation
logit log odds ratio
i1

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
38
concavity
unique maximum
IRLS algorithm
properties of maximum likelihood
Classification
does not have a simple analytic form. However, it is easy to show that for some sigmoid functions, such as the logistic and cumulative Gaussian, the log likelihood is a concave function of w for fixed D. As the quadratic penalty on w is also concave then the log posterior is a concave function, which means that it is relatively easy to find its unique maximum. The concavity can also be derived from the fact that the Hessian of log pwX, y is negative definite see section A.9 for further details. The standard algorithm for finding the maxi mum is Newtons method, which in this context is usually called the iteratively reweighted least squares IRLS algorithm, as described e.g. in McCullagh and Nelder 1983. However, note that Minka 2003 provides evidence that other optimization methods e.g. conjugate gradient ascent may be faster than IRLS.
Notice that a maximum likelihood treatment corresponding to an unpe nalized version of eq. 3.6 may result in some undesirable outcomes. If the dataset is linearly separable i.e. if there exists a hyperplane which separates the positive and negative examples then maximizing the unpenalized likelihood will cause w to tend to infinity, However, this will still give predictions in 0, 1 for py1x, w, although these predictions will be hard i.e. zero or one. If the problem is illconditioned, e.g. due to duplicate or linearly dependent input dimensions, there will be no unique solution.
As an example, consider linear logistic regression in the case where xspace is two dimensional and there is no bias weight so that w is also twodimensional. The prior in weight space is Gaussian and for simplicity we have set pI. Contours of the prior pw are illustrated in Figure 3.1a. If we have a data set D as shown in Figure 3.1b then this induces a posterior distribution in weight space as shown in Figure 3.1c. Notice that the posterior is nonGaussian and unimodal, as expected. The dataset is not linearly separable but a weight vector in the direction 1,1 is clearly a reasonable choice, as the posterior distribution shows. To make predictions based the training set D for a test point x we have

predictions
softmax multiple logistic
py1x,D
py1w,xpwDdw, 3.7
integratingthepredictionpy1w,xx wovertheposteriordistri bution of weights. This leads to contours of the predictive distribution as shown in Figure 3.1d. Notice how the contours are bent, reflecting the integration of many different but plausible ws.
In the multiclass case we use the multiple logistic or softmax function expx wc
pyCcx,Wc expxwc, 3.8
where wc is the weight vector for class c, and all weight vectors are col lected into the matrix W. The corresponding log likelihood is of the form ni1 Cc1 c,yi xi wclogc expxi wc . As in the binary case the log likelihood is a concave function of W.
It is interesting to note that in a generative approach where the class conditional distributions pxy are Gaussian with the same covariance matrix,

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.3 Gaussian Process Classification
39
5
0
5
5 0 5
x 1
2 1 0
1
2
2 1 0 1 2
w1
a b
2
1
0
1
2
2 1 0 1 2
w 1
5
0
5
5 0 5
x 1
0.9
0.5
0.1
c d
Figure 3.1: Linear logistic regression: Panel a shows contours of the prior distri bution pwN0,I. Panel b shows the dataset, with circles indicating class 1 and crosses denoting class 1. Panel c shows contours of the posterior distribution pwD. Panel d shows contours of the predictive distribution py 1x.
pyx has the form given by eq. 3.4 and eq. 3.8 for the two and multiclass cases respectively when the constant function 1 is included in x.
3.3 Gaussian Process Classification
For binary classification the basic idea behind Gaussian process prediction is very simplewe place a GP prior over the latent function fx and then squash this through the logistic function to obtain a prior on xpy1xfx. Note thatis a deterministic function of f, and since f is stochastic, so is . This construction is illustrated in Figure 3.2 for a one dimensional input space. It is a natural generalization of the linear logistic
latent function
w2 w2
xx 22

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
40
Classification
nuisance function
noisefree latent process
Figure 3.2: Panel a shows a sample latent function fx drawn from a Gaussian process as a function of x. Panel b shows the result of squashing this sample func tion through the logistic logit function, z1expz1 to obtain the class probability xfx.
regression model and parallels the development from linear regression to GP regression that we explored in section 2.1. Specifically, we replace the linear fx function from the linear logistic model in eq. 3.6 by a Gaussian process, and correspondingly the Gaussian prior on the weights by a GP prior.
The latent function f plays the role of a nuisance function: we do not observe values of f itself we observe only the inputs X and the class labels y and we are not particularly interested in the values of f, but rather in , in particular for test cases x. The purpose of f is solely to allow a convenient formulation of the model, and the computational goal pursued in the coming sections will be to remove integrate out f.
We have tacitly assumed that the latent Gaussian process is noisefree, and combined it with smooth likelihood functions, such as the logistic or probit. However, one can equivalently think of adding independent noise to the latent process in combination with a stepfunction likelihood. In particular, assuming Gaussian noise and a stepfunction likelihood is exactly equivalent to a noise free8 latent process and probit likelihood, see exercise 3.10.1.
Inference is naturally divided into two steps: first computing the distribution of the latent variable corresponding to a test case

4
2
0
2
4
input, x
a b
pfX,y,x
pfX,x,fpfX,ydf, 3.9
where pfX,ypyfpfXpyX is the posterior over the latent vari ables, and subsequently using this distribution over the latent f to produce a probabilistic prediction

py1X,y,x
fpfX,y,xdf. 3.10
1
0
input, x
8This equivalence explains why no numerical problems arise from considering a noisefree process if care is taken with the implementation, see also comment at the end of section 3.4.3.
latent function, fx
class probability, x

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.4 The Laplace Approximation for the Binary GP Classifier 41
In the regression case with Gaussian likelihood computation of predictions was straightforward as the relevant integrals were Gaussian and could be computed analytically. In classification the nonGaussian likelihood in eq. 3.9 makes the integral analytically intractable. Similarly, eq. 3.10 can be intractable analytically for certain sigmoid functions, although in the binary case it is only a onedimensional integral so simple numerical techniques are generally adequate.
Thus we need to use either analytic approximations of integrals, or solutions based on Monte Carlo sampling. In the coming sections, we describe two ana lytic approximations which both approximate the nonGaussian joint posterior with a Gaussian one: the first is the straightforward Laplace approximation method Williams and Barber, 1998, and the second is the more sophisticated expectation propagation EP method due to Minka 2001. The cavity TAP ap proximation of Opper and Winther 2000 is closely related to the EP method. A number of other approximations have also been suggested, see e.g. Gibbs and MacKay 2000, Jaakkola and Haussler 1999, and Seeger 2000. Neal 1999 describes the use of Markov chain Monte Carlo MCMC approximations. All of these methods will typically scale as On3; for large datasets there has been much work on further approximations to reduce computation time, as discussed in chapter 8.
The Laplace approximation for the binary case is described in section 3.4, and for the multiclass case in section 3.5. The EP method for binary clas sification is described in section 3.6. Relationships between Gaussian process classifiers and other techniques such as spline classifiers, support vector ma chines and leastsquares classification are discussed in sections 6.3, 6.4 and 6.5 respectively.
3.4 The Laplace Approximation for the Binary GP Classifier
Laplaces method utilizes a Gaussian approximation qfX,y to the poste rior pfX,y in the integral 3.9. Doing a second order Taylor expansion of logpfX,y around the maximum of the posterior, we obtain a Gaussian approximation
qfX,yNff,A1exp 1f fAf f, 3.11 2
where fargmaxf pf X, y and A log pf X, yf f is the Hessian of the negative log posterior at that point.
The structure of the rest of this section is as follows: In section 3.4.1 we describe how to find f and A. Section 3.4.2 explains how to make predictions having obtained qfy, and section 3.4.3 gives more implementation details for the Laplace GP classifier. The Laplace approximation for the marginal likelihood is described in section 3.4.4.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
42
Classification
log likelihood 1st derivative 2nd derivative
log likelihood 1st derivative 2nd derivative
12
00 1 2 2 4 3 6
2 0 2
unnormalized posterior
2 0 2
latent times target, z y f
iii iii
a, logistic b, probit
Figure 3.3: The log likelihood and its derivatives for a single case as a function of ziyifi, for a the logistic, and b the cumulative Gaussian likelihood. The two likelihood functions are fairly similar, the main qualitative difference being that for large negative arguments the log logistic behaves linearly whereas the log cumulative Gaussian has a quadratic penalty. Both likelihoods are log concave.
3.4.1 Posterior
By Bayes rule the posterior over the latent variables is given by pfX,ypyfpfXpyX, but as pyX is independent of f, we need only consider the unnormalized posterior when maximizing w.r.t. f. Taking the logarithm and introducing expression eq. 2.29 for the GP prior gives
3.12
3.13 3.14
flogpyflogpfX
logpyf1fK1f1logKnlog2.
latent times target, z y f
log likelihoods
and their derivatives
where Wlogpyf is diagonal, since the likelihood factorizes over cases the distribution for yi depends only on fi, not on fji. Note, that if the likelihood pyf is log concave, the diagonal elements of W are nonnegative, and the Hessian in eq. 3.14 is negative definite, so that f is concave and has a unique maximum see section A.9 for further details.
There are many possible functional forms of the likelihood, which gives the target class probability as a function of the latent variable f. Two commonly used likelihood functions are the logistic, and the cumulative Gaussian, see Figure 3.3. The expressions for the log likelihood for these likelihood functions and their first and second derivatives w.r.t. the latent variable are given in the
222 Differentiating eq. 3.12 w.r.t. f we obtain
flogpyfK1f,
f log pyfK1WK1,
log likelihood, log py fii
log likelihood, log py fii

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.4 The Laplace Approximation for the Binary GP Classifier
43
following table: logpyifi
2
f logpyifi f2 logpyifi
ii
tii yiN fi
log 1expyifi log yifi
i1i
N fi2 yifiN fi
3.15 3.16
yifi2yifi
0fKlogpyf, 3.17
yifi
where we have defined ipyi1fi and ty12. At the maximum
of f we have
as a selfconsistent equation for f but sincelog pyfis a nonlinear function of f, eq. 3.17 cannot be solved directly. To find the maximum ofwe use Newtons method, with the iteration
fnewf 1f K1 W1logpyfK1f
K1 W1Wf logpyf. 3.18
To gain more intuition about this update, let us consider what happens to datapoints that are wellexplained under f so thatlog pyififi and Wii are close to zero for these points. As an approximation, break f into two subvectors, f1 that corresponds to points that are not wellexplained, and f2 to those that are. Then it is easy to show see exercise 3.10.4 that
Newtons method
fnewK I W K 1W f logpy f , 111111111111 11
3.19 where K21 denotes the n2n1 block of K containing the covariance between
the two groups of points, etc. This means that fnew is computed by ignoring 1
entirely the wellexplained points, and fnew is predicted from fnew using the 21
usual GP prediction methods i.e. treating these points like test points. Of course, if the predictions of fnew fail to match the targets correctly they would
cease to be wellexplained and so be updated on the next iteration.
Having found the maximum posterior f, we can now specify the Laplace approximation to the posterior as a Gaussian with mean f and covariance matrix given by the negative inverse Hessian offrom eq. 3.14
qfX,yNf, K1 W1. 3.20
One problem with the Laplace approximation is that it is essentially un controlled, in that the Hessian evaluated at f may give a poor approximation to the true shape of the posterior. The peak could be much broader or nar rower than the Hessian indicates, or it could be a skew peak, while the Laplace approximation assumes it has elliptical contours.
fnewK K1fnew, 2 21 11 1
2
intuition on influence of wellexplained points

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
44
latent mean
Classification
sign of kernel coefficients
where we have used the fact that for a GP Eff,X,xkxK1f and have let EfX,y denote the posterior mean of f given X and y. Notice the similarity between the middle expression of eq. 3.21 and eq. 3.22, where the exact intractable average EfX,y has been replaced with the modal value fEqfX,y.
A simple observation from eq. 3.21 is that positive training examples will give rise to a positive coefficient for their kernel function as i log pyifi0 in this case, while negative examples will give rise to a negative coefficient; this is analogous to the solution to the support vector machine, see eq. 6.34. Also note that training points which have ilogpyifi0 i.e. that are wellexplained under f do not contribute strongly to predictions at novel test points; this is similar to the behaviour of nonsupport vectors in the support vector machine see section 6.4.
We can also compute VqfX,y, the variance of fX,y under the Gaussian approximation. This comprises of two terms, i.e.
latent variance
averaged predictive probability
The first term is due to the variance of f if we condition on a particular value of f, and is given by kx, xkxK1kx, cf. eq. 2.19. The second term in eq. 3.23 is due to the fact that EfX,x,fkxK1f depends on f and thus there is an additional term of kxK1 covfX,yK1kx. UndertheGaussianapproximationcovfX,yK1 W1,andthus
VqfX,y,xkx,xk K1k k K1K1 W1K1k
kx, xk KW 11k, 3.24
where the last line is obtained using the matrix inversion lemma eq. A.9.
Given the mean and variance of f, we make predictions by computing
3.4.2 Predictions
The posterior mean for f under the Laplace approximation can be expressed by combining the GP predictive mean eq. 2.25 with eq. 3.17 into
EqfX,y,xkxK1fkxlogpyf. 3.21 Compare this with the exact mean, given by Opper and Winther 2000 as

EpfX,y,x
Eff,X,xpfX,ydf 3.22
kxK1f pfX,ydfkxK1EfX,y,
VqfX,y,xEpfX,x,ff EfX,x,f2
EqfX,yEfX,x,fEfX,y,x2.
3.23
EqX, y, x
fqfX, y, x df, 3.25

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.4 The Laplace Approximation for the Binary GP Classifier
where qfX,y,x is Gaussian with mean and variance given by equations 3.21 and 3.24 respectively. Notice that because of the nonlinear form of the sigmoid the predictive probability from eq. 3.25 is different from the sigmoid of the expectation of f: Eqfy. We will call the latter the MAP prediction to distinguish it from the averaged predictions from eq. 3.25.
In fact, as shown in Bishop 1995, sec. 10.3, the predicted test labels given by choosing the class of highest probability obtained by averaged and MAP predictions are identical for binary9 classification. To see this, note that the decision boundary using the the MAP value EqfX,y,x corre sponds to EqfX, y, x12 or EqfX, y, x0. The decision bound ary of the averaged prediction, EqX,y,x12, also corresponds to EqfX, y, x0. This follows from the fact that f12 is antisym metric while qfX, y, x is symmetric.
Thus if we are concerned only about the most probable classification, it is
not necessary to compute predictions using eq. 3.25. However, as soon as we
also need a confidence in the prediction e.g. if we are concerned about a reject
option we need EqX,y,x. If z is the cumulative Gaussian function
then eq. 3.25 can be computed analytically, as shown in section 3.9. On
the other hand ifis the logistic function then we need to resort to sampling
methods or analytical approximations to compute this onedimensional integral.
One attractive method is to note that the logistic function z is the c.d.f.
cumulative density function corresponding to the p.d.f. probability density
function pzsech2z24; this is known as the logistic or sechsquared
distribution, see Johnson et al. 1995, ch. 23. Then by approximating pz as a
mixture of Gaussians, one can approximate z by a linear combination of error
functions. This approximation was used by Williams and Barber 1998, app. A
and Wood and Kohn 1998. Another approximation suggested in MacKay
1992d isf yf, where 2f y1V f X, y, x 81. q
The effect of the latent predictive variance is, as the approximation suggests,
to soften the prediction that would be obtained using the MAP prediction
f, i.e. to move it towards 12.
3.4.3 Implementation
We give implementations for finding the Laplace approximation in Algorithm 3.1 and for making predictions in Algorithm 3.2. Care is taken to avoid numer ically unstable computations while minimizing the computational effort; both can be achieved simultaneously. It turns out that several of the desired terms can be expressed in terms of the symmetric positive definite matrix
11
BIW2KW2, 3.26
computation of which costs only On2, since W is diagonal. The B matrix has eigenvalues bounded below by 1 and bounded above by 1n maxij Kij 4, so for many covariance functions B is guaranteed to be wellconditioned, and it is
9For multiclass predictions discussed in section 3.5 the situation is more complicated.
45
MAP prediction
identical binary decisions

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
46
Classification
input: K covariance matrix, y 1 targets, pyf likelihood function
2: f : 0 repeat
initialization Newton iteration
4: W :log pyf
11 11
L : choleskyIW 2 KW 26: b : W f log pyf
22
a : bW 1 LLW 1 Kb
8: f : Ka
until convergence
eval. W e.g. using eq. 3.15 or 3.16 BIW 2 KW 2

eq. 3.18 using eq. 3.27 objective: 1af logpyf
12
10: logqyX,:2a flogpyf ilogLii eq.3.32
return: f : f post. mode, log qyX,approx. log marg. likelihood
Algorithm 3.1: Modefinding for binary Laplace GPC. Commonly used convergence criteria depend on the difference in successive values of the objective function f from eq. 3.12, the magnitude of the gradient vector ffrom eq. 3.13 andor the magnitude of the difference in successive values of f. In a practical implementation one needs to secure against divergence by checking that each iteration leads to an increase in the objective and trying a smaller step size if not. The computational complexity is dominated by the Cholesky decomposition in line 5 which takes n36 operations times the number of Newton iterations, all other operations are at most quadratic in n.
thus numerically safe to compute its Cholesky decomposition LLB, which is useful in computing terms involving B1 and B.
The modefinding procedure uses the Newton iteration given in eq. 3.18, involvingthematrixK1W1. Usingthematrixinversionlemmaeq.A.9 we get
22
K1W1KKW1B1W1K, 3.27
where B is given in eq. 3.26. The advantage is that whereas K may have eigenvalues arbitrarily close to zero and thus be numerically unstable to invert, we can safely work with B. In addition, Algorithm 3.1 keeps the vector aK1f in addition to f, as this allows evaluation of the part of the objective f in eq. 3.12 which depends on f without explicit reference to K1 again to avoid possible numerical problems.
Similarly, for the computation of the predictive variance Vqfy from eq. 3.24 we need to evaluate a quadratic form involving the matrix KW11. Re writing this as
222222 KW11W1W1KW11W1W1W1B1W1 3.28
achieves numerical stability as opposed to inverting W itself, which may have arbitrarily small eigenvalues. Thus the predictive variance from eq. 3.24 can be computed as
V f y kx ,x kx W1LL1W1kxq22
kx ,x vv, where vLW1kx , 2
which was also used by Seeger 2003, p. 27.
3.29

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.4 The Laplace Approximation for the Binary GP Classifier
47
input: f mode, X inputs, y 1 targets, k covariance function, pyf likelihood function, x test input
2: W:logpyf
11 11
L : choleskyIW 2 KW 24: f:kx logpyf
BIW 2 KW 2 eq.3.21

v:LW1kx2
6: Vf:kx,xvv
:zN zf, Vf dz

eq. 3.24 using eq. 3.29 eq. 3.25

8: return: predictive class probability for class 1
Algorithm 3.2: Predictions for binary Laplace GPC. The posterior mode f which can be computed using Algorithm 3.1 is input. For multiple test inputs lines 47 are applied to each test input. Computational complexity is n36 operations once line 3 plus n2 operations per test case line 5. The onedimensional integral in line 7 can be done analytically for cumulative Gaussian likelihood, otherwise it is computed using an approximation or numerical quadrature.
In practice we compute the Cholesky decomposition LLB during the Newton steps in Algorithm 3.1, which can be reused to compute the predictive variance by doing backsubstitution with L as discussed above. In addition,
n22
L may again be reused to compute I W1KW1B needed for the
computation of the marginal likelihood eq. 3.32 as log B2log Lii. To save computation, one could use an incomplete Cholesky factorization in the Newton steps, as suggested by Fine and Scheinberg 2002.
Sometimes it is suggested that it can be useful to replace K by KI whereis a small constant, to improve the numerical conditioning10 of K. However, by taking care with the implementation details as above this should not be necessary.
3.4.4 Marginal Likelihood
It will also be useful particularly for chapter 5 to compute the Laplace ap proximation of the marginal likelihood pyX. For the regression case with Gaussian noise the marginal likelihood can again be calculated analytically, see eq. 2.30. We have

pyXpyfpfXdfexpfdf. 3.30
Using a Taylor expansion of f locally around f we obtain ff1 f fAf f and thus an approximation qyX to the marginal likelihood
incomplete Cholesky factorization
2
as
pyXqyXexpf

exp1ffAffdf. 3.31 2
10Neal 1999 refers to this as adding jitter in the context of Markov chain Monte Carlo MCMC based inference; in his work the latent variables f are explicitly represented in the Markov chain which makes addition of jitter difficult to avoid. Within the analytical approximations of the distribution of f considered here, jitter is unnecessary.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
48
Classification
This Gaussian integral can be evaluated analytically to obtain an approximation to the log marginal likelihood
logqyX,1fK1f logpyf 1 logB, 3.32 22
n22
where BKK1 WI W1 KW1 , andis a vector of hyper
parameters of the covariance function which have previously been suppressed from the notation for brevity.
3.5 Multiclass Laplace Approximation
Our presentation follows Williams and Barber 1998. We first introduce the
vector of latent function values at all n training points and for all C classes
ff1,,fn1,f12,,fn2,,f1C,,fnC. 3.33
Thus f has length Cn. In the following we will generally refer to quantities pertaining to a particular class with superscript c, and a particular case by subscript i as usual; thus e.g. the vector of C latents for a particular case is fi. However, as an exception, vectors or matrices formed from the covariance function for class c will have a subscript c. The prior over f has the form fN 0, K. As we have assumed that the C latent processes are uncorrelated, the covariance matrix K is block diagonal in the matrices K1,,KC. Each individual matrix Kc expresses the correlations of the latent function values within the class c. Note that the covariance functions pertaining to the different classes can be different. Let y be a vector of the same length as f which for each i1,,n has an entry of 1 for the class which is the label for example i and 0 for the other C1 entries.
softmax
unnormalized posterior
Let ic denote output of the softmax at training point i, i.e. c c expfic
pyifiiexpfc. 3.34 ci
Thenis a vector of the same length as f with entries ic. The multiclass analogue of eq. 3.12 is the log of the unnormalized posterior
nC
f1fK1fyflogexpfc1 logKCn log2. 3.35
2i22 i1 c1
As in the binary case we seek the MAP value f of pfX,y. By differentiating eq. 3.35 w.r.t. f we obtain
K1f y. 3.36 Thus at the maximum we have fK y . Differentiating again, and using
2j c cc
fcfc log expfi i cci i , 3.37
iij

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.5 Multiclass Laplace Approximation
we obtain11
K1W, where Wdiag, 3.38
whereis a Cnn matrix obtained by stacking vertically the diagonal matrices diagc, and c is the subvector ofpertaining to class c. As in the binary case notice thatis positive definite, thus f is concave and the maximum is unique see also exercise 3.10.2.
As in the binary case we use Newtons method to search for the mode of , giving
fnewK1W1Wfy. 3.39
This update if coded na vely would take OC3n3 as matrices of size Cn have to be inverted. However, as described in section 3.5.1, we can utilize the structure of W to bring down the computational load to OCn3.
The Laplace approximation gives us a Gaussian approximation qfX,y to the posterior pfX,y. To make predictions at a test point x we need to com pute the posterior distribution qfX,y,x where fxff1,,fC. In general we have

49
qfX,y,x
pfX,x,fqfX,ydf. 3.40
As pfX,x,f and qfX,y are both Gaussian, qfX,y,x will also be Gaussian and we need only compute its mean and covariance. The predictive mean for class c is given by
E fcx X,y,x k x K1fck x yc c, 3.41 qccc
where kcx is the vector of covariances between the test point and each of the training points for the cth covariance function, and fc is the subvector of f pertaining to class c. The last equality comes from using eq. 3.36 at the maximum f. Note the close correspondence to eq. 3.21. This can be put into a vector form EqfyQ y by defining the CnC matrix
k1x 0 0
0 k2x 0Q 3.42
. 0 0 kCx
Using a similar argument to eq. 3.23 we obtain covqfX,y,x Q K1K1 W1K1Q
3.43 whereisadiagonalCCmatrixwith kx,xkxK1kx,
diagkx,xQ KW11Q,
cc c cc c
and kx,x is a vector of covariances, whose cth element is kcx,x.
11There is a sign error in equation 23 of Williams and Barber 1998 but not in their implementation.
predictive distribution for f

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
50
Classification
input: K covariance matrix, y 01 targets
2: f : 0 initialization
repeat Newton iteration 4: computeandfrom f with eq. 3.34 and defn. ofunder eq. 3.38
for c : 1C do
11 6: L : choleskyInDc2 KcDc2
111111 Ec : Dc2 LLDc2E is block diag. D2 ICnD2 KD2 1D2
8: zc :log Lii compute 1 log determinant i2
end for
10: M : choleskyc Ec
b : Dfy
bWfy from eq. 3.39
eq. 3.39 using eq. 3.45 and 3.47
12: c : EKb
a : bcERMMRc
14: f : Ka

until convergence objective: 1af yflog expfi
12ii cc 16: logqyX,:2a fy f ilog cexpfcczc eq.3.44
return: f : f post. mode, log qyX,approx. log marg. likelihood
marginal likelihood
Algorithm 3.3: Modefinding for multiclass Laplace GPC, where Ddiag, R is a matrix of stacked identity matrices and a subscript c on a block diagonal matrix indicates the nn submatrix pertaining to class c. The computational complexity is dominated by the Cholesky decomposition in lines 6 and 10 and the forward and backward substitutions in line 7 with total complexity OC1n3 times the num ber of Newton iterations, all other operations are at most OCn2 when exploiting diagonal and block diagonal structures. The memory requirement is OCn2. For comments on convergence criteria for line 15 and avoiding divergence, refer to the caption of Algorithm 3.1 on page 46.
We now need to consider the predictive distribution qy which is ob tained by softmaxing the Gaussian qfy. In the binary case we saw that the predicted classification could be obtained by thresholding the mean value of the Gaussian. In the multiclass case one does need to take the variability around the mean into account as it can affect the overall classification see exercise 3.10.3. One simple way which will be used in Algorithm 3.4 to estimate the mean prediction Eqy is to draw samples from the Gaussian qfy, softmax them and then average.
The Laplace approximation to the marginal likelihood can be obtained in the same way as for the binary case, yielding
logpyX,logqyX,
3.44
nC
11 22
1fK1f yf2
expfc1 logI W KW . i 2 Cn
log
As for the inversion of K1 W, the determinant term can be computed effi
ciently by exploiting the structure of W , see section 3.5.1.
In this section we have described the Laplace approximation for multiclass classification. However, there has also been some work on EPtype methods for the multiclass case, see Seeger and Jordan 2004.
i1
c1

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.5 Multiclass Laplace Approximation 51
input: K covariance matrix, f posterior mode, x test input
2: computeandfrom f using eq. 3.34 and defn. ofunder eq. 3.38
for c : 1C do
11
4: L : choleskyInDc2 KcDc2
11 1111
Ec : Dc2 LLDc26: end for
M : choleskyc Ec 8: forc:1Cdo
c : ycckc 10: b : Eckc
c : EcRMMRb 12: for c : 1 . . . C do
E is block diag. D2 ICnD2 KD2 1D2
latent test mean from eq. 3.41
latent test covariance from eq. 3.43
: ckc cc
14: end for
cc : cckcx, xbkc

16: end for: 0
18: for i : 1 : S do fN,
initialize Monte Carlo loop to estimate predictive class probabilities using S samples sample latent values from joint Gaussian posterior
20:: expfc expfc accumulateprobabilityeq.3.34 c
end for
22: : S normalize MC estimate of prediction vector return: Eqffxx,X,y : predicted class probability vector
Algorithm 3.4: Predictions for multiclass Laplace GPC, where Ddiag, R is a matrix of stacked identity matrices and a subscript c on a block diagonal matrix indicates the nn submatrix pertaining to class c. The computational complexity is dominated by the Cholesky decomposition in lines 4 and 7 with a total complexity OC1n3, the memory requirement is OCn2. For multiple test cases repeat from line 8 for each test case in practice, for multiple test cases one may reorder the computations in lines 816 to avoid referring to all Ec matrices repeatedly.
3.5.1 Implementation
The implementation follows closely the implementation for the binary case de tailed in section 3.4.3, with the slight complications that K is now a block diagonal matrix of size CnCn and the W matrix is no longer diagonal, see eq. 3.38. Care has to be taken to exploit the structure of these matrices to reduce the computational burden.
The Newton iteration from eq. 3.39 requires the inversion of K1W, which we first rewrite as
K1W1KKKW11K, 3.45
using the matrix inversion lemma, eq. A.9. In the following the inversion of the above matrix KW 1 is our main concern. First, however, we apply the

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
52
Classification
matrix inversion lemma, eq. A.9 to the W matrix:12
W1D1D1RIRDR1R
D1RO1R,
3.46
where Ddiag, RD1 is a Cnn matrix of stacked In unit matrices, we use the fact thatnormalizes over classes: RDRc DcIn and O is the zero matrix. Introducing the above in KW 1 and applying the matrix inversion lemma, eq. A.9 again we have
KW 11KD1RO1R1 3.47EERORER1REEERc Ec1RE.
2222
where EK D11D1 I D1 KD1 1D1 is a block diagonal matrix
and RERc Ec. The Newton iterations can now be computed by inserting eq. 3.47 and 3.45 in eq. 3.39, as detailed in Algorithm 3.3. The predictions use an equivalent route to compute the Gaussian posterior, and the final step of deriving predictive class probabilities is done by Monte Carlo, as shown in Algorithm 3.4.
3.6 Expectation Propagation
The expectation propagation EP algorithm Minka, 2001 is a general approxi mation tool with a wide range of applications. In this section we present only its application to the specific case of a GP model for binary classification. We note that Opper and Winther 2000 presented a similar method for binary GPC based on the fixedpoint equations of the ThoulessAndersonPalmer TAP type of meanfield approximation from statistical physics. The fixed points for the two methods are the same, although the precise details of the two algorithms are different. The EP algorithm naturally lends itself to sparse approximations, which will not be discussed in detail here, but touched upon in section 8.4.
The object of central importance is the posterior distribution over the latent variables, pfX,y. In the following notation we suppress the explicit depen dence on hyperparameters, see section 3.6.2 for their treatment. The posterior is given by Bayes rule, as the product of a normalization term, the prior and the likelihood
1 n
pfX,yZpfX
likelihood
n ZpyXpfX
i1
pyifi, 3.48
i1
where the prior pfX is Gaussian and we have utilized the fact that the likeli
hood factorizes over the training cases. The normalization term is the marginal
pyifidf. 3.49
12Readers who are disturbed by our sloppy treatment of the inverse of singular matrices are invited to insert the matrix 1In betweenandin eq. 3.46 and verify that eq. 3.47 coincides with the limit 0.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.6 Expectation Propagation
So far, everything is exactly as in the regression case discussed in chapter 2. However, in the case of classification the likelihood pyifi is not Gaussian, a property that was used heavily in arriving at analytical solutions for the regression framework. In this section we use the probit likelihood see page 35 for binary classification
pyififiyi, 3.50
and this makes the posterior in eq. 3.48 analytically intractable. To overcome this hurdle in the EP framework we approximate the likelihood by a local like lihood approximation13 in the form of an unnormalized Gaussian function in the latent variable fi
pyifitifiZ i, i, i2Z iNfi i, i2, 3.51
which defines the site parameters Z i,i andi2. Remember that the notation N is used for a normalized Gaussian distribution. Notice that we are approxi mating the likelihood, i.e. a probability distribution which normalizes over the targets yi, by an unnormalized Gaussian distribution over the latent variables fi. This is reasonable, because we are interested in how the likelihood behaves as a function of the latent fi. In the regression setting we utilized the Gaussian shape of the likelihood, but more to the point, the Gaussian distribution for the outputs yi also implied a Gaussian shape as a function of the latent vari able fi. In order to compute the posterior we are of course primarily interested in how the likelihood behaves as a function of fi.14 The property that the likelihood should normalize over yi for any value of fi is not simultaneously achievable with the desideratum of Gaussian dependence on fi; in the EP ap proximation we abandon exact normalization for tractability. The product of the independent local likelihoods ti is
n
tifiZ i, i, i2N , Z i, 3.52 i1 i
where is the vector ofi and is diagonal withii i2. We approximate the posterior pfX,y by qfX,y
1 n
qfX,yZ pfX tifiZ i, i, i2N,,
EP i1
with1 , and K1 11, 3.53
where we have used eq. A.7 to compute the product and by definition, we know that the distribution must normalize correctly over f . Notice, that we use the tildeparameters and and Zfor the local likelihood approximations,
13Note, that although each likelihood approximation is local, the posterior approximation produced by the EP algorithm is global because the latent variables are coupled through the prior.
14However, for computing the marginal likelihood normalization becomes crucial, see section 3.6.2.
53
site parameters

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
54
KL divergence
Classification
and plainandfor the parameters of the approximate posterior. The nor malizing term of eq. 3.53, ZEPqyX, is the EP algorithms approximation to the normalizing term Z from eq. 3.48 and eq. 3.49.
How do we choose the parameters of the local approximating distributions ti? One of the most obvious ideas would be to minimize the KullbackLeibler KL divergence see section A.5 between the posterior and its approximation: KL pf X, yqf X, y. Direct minimization of this KL divergence for the joint distribution on f turns out to be intractable. One can alternatively choose to minimize the reversed KL divergence KLqfX,ypfX,y with respect to the distribution qfX,y; this has been used to carry out variational inference for GPC, see, e.g. Seeger 2000.
Instead, the key idea in the EP algorithm is to update the individual ti ap proximations sequentially. Conceptually this is done by iterating the following four steps: we start from some current approximate posterior, from which we leave out the current ti, giving rise to a marginal cavity distribution. Secondly, we combine the cavity distribution with the exact likelihood pyifi to get the desired nonGaussian marginal. Thirdly, we choose a Gaussian approximation to the nonGaussian marginal, and in the final step we compute the ti which makes the posterior have the desired marginal from step three. These four steps are iterated until convergence.
In more detail, we optimize the ti approximations sequentially, using the approximation so far for all the other variables. In particular the approximate posterior for fi contains three kinds of terms:
1. the prior pfX
2. the local approximate likelihoods tj for all cases ji 3. the exact likelihood for case i, pyifiyifi
Our goal is to combine these sources of information and choose parameters of ti such that the marginal posterior is as accurate as possible. We will first combine the prior and the local likelihood approximations into the cavity distribution
j i
and subsequently combine this with the exact likelihood for case i. Concep tually, one can think of the combination of prior and the n1 approximate likelihoods in eq. 3.54 in two ways, either by explicitly multiplying out the terms, or equivalently by removing approximate likelihood i from the approx imate posterior in eq. 3.53. Here we will follow the latter approach. The marginal for fi from qfX,y is obtained by using eq. A.6 in eq. 3.53 to give
qfiX,yNfii,i2, 3.55 where i2ii. This marginal eq. 3.55 contains one approximate term
qifipfX
tjfjZj, j, jdfj, 3.54
2

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.6 Expectation Propagation
namely ti too many, so we need to divide it by ti to get the cavity dis tribution
q fNf ,2 , 3.56 i i i i i
where 2 2 2, and 22 21. iiiiii iii
Note that the cavity distribution and its parameters carry the subscript i, indicating that they include all cases except number i. The easiest way to verify eq. 3.56 is to multiply the cavity distribution by the local likelihood approximation ti from eq. 3.51 using eq. A.7 to recover the marginal in eq. 3.55. Notice that despite the appearance of eq. 3.56, the cavity mean and variance are of course not dependent oni andi2, see exercise 3.10.5.
To proceed, we need to find the new unnormalized Gaussian marginal which best approximates the product of the cavity distribution and the exact likelihood
qfiZiNi,i2qifipyifi. 3.57
It is well known that when qx is Gaussian, the distribution qx which min imizes KLpxqx is the one whose first and second moments match that of px, see eq. A.24. As qfi is unnormalized we choose additionally to impose the condition that the zeroth moments normalizing constants should match when choosing the parameters of qfi to match the right hand side of eq. 3.57. This process is illustrated in Figure 3.4.
The derivation of the moments is somewhat lengthy, so we have moved the details to section 3.9. The desired posterior marginal moments are
55
cavity distribution
y2 Nzii i
Zizi, ii 2 , zi 1i
3.58 22ii zi,wherezii.
4Nz Nz
i i 12 zi i zi i 12
i
y
i
The final step is to compute the parameters of the approximation ti which achieves a match with the desired moments. In particular, the product of the cavity distribution and the local approximation must have the desired moments, leading to
i i22i2i,i2221, ii ii
2 2 1 2 2 2 3.59 ZiZi 2 ii exp 2ii ii ,
which is easily verified by multiplying the cavity distribution by the local ap proximation using eq. A.7 to obtain eq. 3.58. Note that the desired marginal posterior variance i2 given by eq. 3.58 is guaranteed to be smaller than the cavity variance, such thati20 is always satisfied.15
This completes the update for a local likelihood approximation ti. We then have to update the approximate posterior using eq. 3.53, but since only a
15In cases where the likelihood is log concave, one can show thati20, but for a general likelihood there may be no such guarantee.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
56
Classification
1
0.8
0.6
0.4
0.2
likelihood cavity posterior approximation
0.14
0.12
0.1
0.08
0.06
0.04
0.02
00
5 0 5 10 5 0 5 10
a b
Figure 3.4: Approximating a single likelihood term by a Gaussian. Panel a dash
dotted: the exact likelihood, fi the corresponding target being yi1 as a
function of the latent fi, dotted: Gaussian cavity distribution Nfii 1, 2 9, i
solid: posterior, dashed: posterior approximation. Panel b shows an enlargement of panel a.
single site has changed one can do this with a computationally efficient rank one update, see section 3.6.3. The EP algorithm is used iteratively, updating each local approximation in turn. It is clear that several passes over the data are required, since an update of one local approximation potentially influences all of the approximate marginal posteriors.
3.6.1 Predictions
The procedure for making predictions in the EP framework closely resembles the algorithm for the Laplace approximation in section 3.4.2. EP gives a Gaus sian approximation to the posterior distribution, eq. 3.53. The approximate predictive mean for the latent variable f becomes
EqfX,y,xk K1k K1K111 1 kK 1 .
3.60 The approximate latent predictive variance is analogous to the derivation from
eq. 3.23 and eq. 3.24, with playing the role of W
VqfX,y,xkx,xk K1k. 3.61
The approximate predictive distribution for the binary target becomes

qy 1X,y,xEqX,y,x
fqfX,y,xdf, 3.62
where qfX, y, x is the approximate latent predictive Gaussian with mean and variance given by eq. 3.60 and eq. 3.61. This integral is readily evaluated using eq. 3.80, giving the predictive probability
k K 1 qy 1X,y,x 1
1kx,xkK k
. 3.63

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.6 Expectation Propagation
3.6.2 Marginal Likelihood
The EP approximation to the marginal likelihood can be found from the nor malization of eq. 3.53
i1
regression setting in equations 2.28 and 2.30 we arrive at
logZEP1 log K 1 K 13.65 22
57
n ZEPqyXpfX
tifiZ i,i,i2 df. 3.64 Using eq. A.7 and eq. A.8 in an analogous way to the treatment of the

logi i
nnn
y i i1log2 2
i1 1i i1
2 2
ii2 222,
wheredenotes the hyperparameters of the covariance function. This expres sion has a nice intuitive interpretation: the first two terms are the marginal likelihood for a regression model for, each component of which has inde pendent Gaussian noise of varianceii as is diagonal, cf. eq. 2.30. The remaining three terms come from the normalization constants Z i. The first of these penalizes the cavity or leaveoneout distributions for not agreeing with the classification labels, see eq. 3.82. In other words, we can see that the marginal likelihood combines two desiderata, 1 the means of the local likelihood approximations should be well predicted by a GP, and 2 the corre sponding latent function, when ignoring a particular training example, should be able to predict the corresponding classification label well.
3.6.3 Implementation
The implementation for the EP algorithm follows the derivation in the previous
section closely, except that care has to be taken to achieve numerical stability,
in similar ways to the considerations for Laplaces method in section 3.4.3.
In addition, we wish to be able to specifically handle the case were some site
variancesi2 may tend to infinity; this corresponds to ignoring the corresponding
likelihood terms, and can form the basis of sparse approximations, touched upon
in section 8.4. In this limit, everything remains welldefined, although this is
not obvious e.g. from looking at eq. 3.65. It turns out to be slightly more
convenient to use natural parameters , and,for the site and cavity ii ii
natural parameters
parameters
22
, Sdiag ,S , ,3.66 i i i i i ii
rather than2, and 2 ,themselves. The symmetric matrix of central i i i i
importance is
11
BIS 2KS 2, 3.67
i1 i i
which plays a role equivalent to eq. 3.26. Expressions involving the inverse of B are computed via Cholesky factorization, which is numerically stable since

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
58
Classification
input: K covariance matrix, y 1 targets 2: : 0, : 0,: K,: 0
repeat
4: for i : 1 to n do
: 2i i i
6: i : 2i i i
initialization and eq. 3.53
compute the marginal moments i and i2 using eq. 3.58
8::2and :updatesiteparameters
compute approximate cavity para
meters i and i using eq. 3.56
i i i i
: 2 and using eq. 3.59
i
iii ii
i1
10: : ii
: 12: end for
n22 L : choleskyIS 1 KS 1
2 14: V : LS 1 K
:KVV and:16: until convergence
1
sisi
updateandbyeq.3.70and eq. 3.53. si is column i of

recompute the approximate posterior parametersandusingeq.3.53andeq.3.68
compute log ZEP using eq. 3.65, 3.73 and 3.74 and the existing L 18: return:, natural site param., log ZEP approx. log marg. likelihood
Algorithm 3.5: Expectation Propagation for binary classification. The targets y are used only in line 7. In lines 1315 the parameters of the approximate posterior are recomputed although they already exist; this is done because of the large number of rankone updates in line 10 which would eventually cause loss of numerical precision in . The computational complexity is dominated by the rankone updates in line 10, which takes On2 per variable, i.e. On3 for an entire sweep over all variables. Similarly recomputingin lines 1315 is On3.
the eigenvalues of B are bounded below by one. The parameters of the Gaussian approximate posterior from eq. 3.53 are computed as
11
K1 S 1K KK S 11KK KS 2 B1S 2 K. 3.68
After updating the parameters of a site, we need to update the approximate posterior eq. 3.53 taking the new site parameters into account. For the inverse covariance matrix of the approximate posterior we have from eq. 3.53
1K1 S , and thus 1K1 S newolde e, 3.69 new oldiiii
where ei is a unit vector in direction i, and we have used that S diag . Using the matrix inversion lemma eq. A.9, on eq. 3.69 we obtain the new
new old
new old i i sisi, 3.70
in time On2, where si is the ith column of old. The posterior mean is then calculated from eq. 3.53.
In the EP algorithm each site is updated in turn, and several passes over all sites are required. Pseudocode for the EPGPC algorithm is given in Algorithm
1 new oldold i i ii

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.6 Expectation Propagation 59
input:, natural site param., X inputs, y 1 targets,
k covariance function, x test input
n22 n22 2: L:choleskyI S 1KS 1 BI S 1KS 1
11
z : S 2 LLS 2 K
4: f:kxz
v : LS 1 kx2
6: Vf :kx ,x vv
eq. 3.60 using eq. 3.71
:f1Vf

8: return: predictive class probability for class 1

eq. 3.61 using eq. 3.72 eq.3.63
Algorithm 3.6: Predictions for expectation propagation. The natural site parameters and of the posterior which can be computed using algorithm 3.5 are input. For multiple test inputs lines 47 are applied to each test input. Computational complexity is n36n2 operations once line 2 and 3 plus n2 operations per test case line 5, although the Cholesky decomposition in line 2 could be avoided by storing it in Algorithm 3.5. Note the close similarity to Algorithm 3.2 on page 47.
3.5. There is no formal guarantee of convergence, but several authors have reported that EP for Gaussian process models works relatively well.16
For the predictive distribution, we get the mean from eq. 3.60 which is evaluated using
EqfX,y,xk K S 11S 1 k I K S 11K3.71 11
k IS2 B1 S2 K, and the predictive variance from eq. 3.61 similarly by
VqfX,y,xkx,xk K S 11k 11
3.72
kx,xk S 2 B1S 2 k. Pseudocode for making predictions using EP is given in Algorithm 3.6.
Finally, we need to evaluate the approximate log marginal likelihood from
eq. 3.65. There are several terms which need careful consideration, principally
due to the fact the values may be arbitrarily small and cannot safely be i
inverted. We start with the fourth and first terms of eq. 3.65
1 logT1S 1 1 logK 1 logS 1IS T1 1 logS 1B
2222
1 log1 1logL , 3.73
2 ii ii ii
where T is a diagonal matrix of cavity precisions Tiii2 and L is the i
Cholesky factorization of B. In eq. 3.73 we have factored out the matrix S 1 from both determinants, and the terms cancel. Continuing with the part of the
16It has been conjectured but not proven by L. Csat o personal communication that EP is guaranteed to converge if the likelihood is log concave.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
60
Classification
fifth term from eq. 3.65 which is quadratic in together with the second term 1 T1 S 11 1 K1
3.74
22
1 S 1T1 S 11 KS 11S 12
1 K1S 1TS 12
11 1 1 1
2KKS 2B S 2KTS ,
where in eq. 3.74 we apply the matrix inversion lemma eq. A.9 to both parenthesis to be inverted. The remainder of the fifth term in eq. 3.65 is evaluated using the identity
1 T1 S 11 2 1 TS T1S2 , 3.75 2 i i 2 i i
where i is the vector of cavity means i. The third term in eq. 3.65 requires in no special treatment and can be evaluated as written.
3.7 Experiments
In this section we present the results of applying the algorithms for GP clas sification discussed in the previous sections to several data sets. The purpose is firstly to illustrate the behaviour of the methods and secondly to gain some insights into how good the performance is compared to some other commonly used machine learning methods for classification.
Section 3.7.1 illustrates the action of a GP classifier on a toy binary pre diction problem with a 2d input space, and shows the effect of varying the lengthscale l in the SE covariance function. In section 3.7.2 we illustrate and compare the behaviour of the two approximate GP methods on a simple one dimensional binary task. In section 3.7.3 we present results for a binary GP classifier on a handwritten digit classification task, and study the effect of vary ing the kernel parameters. In section 3.7.4 we carry out a similar study using a multiclass GP classifier to classify digits from all ten classes 09. In section 3.8 we discuss the methods from both experimental and theoretical viewpoints.
3.7.1 A Toy Problem
Figure 3.5 illustrates the operation of a Gaussian process classifier on a binary problem using the squared exponential kernel with a variable lengthscale and the logistic response function. The Laplace approximation was used to make the plots. The data points lie within the square 0,12, as shown in panel a. Notice in particular the lone white point amongst the black points in the NE corner, and the lone black point amongst the white points in the SW corner.
In panel b the lengthscale is l0.1, a relatively short value. In this case the latent function is free to vary relatively quickly and so the classifications

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.7 Experiments 61
0.25
0.75 0.5

a
0.25 0.5
b

0.7
0.5
0.5
0.7
0.5
0.3
0.3
c
d
Figure 3.5: Panel a shows the location of the data points in the twodimensional space 0,12. The two classes are labelled as open circles 1 and closed circles
1. Panels bd show contour plots of the predictive probability Eqxy for signal variance f29 and lengthscales l of 0.1, 0.2 and 0.3 respectively. The de cision boundaries between the two classes are shown by the thicker black lines. The maximum value attained is 0.84, and the minimum is 0.19.
provided by thresholding the predictive probability Eqxy at 0.5 agrees with the training labels at all data points. In contrast, in panel d the length scale is set to l0.3. Now the latent function must vary more smoothly, and so the two lone points are misclassified. Panel c was obtained with l0.2. As would be expected, the decision boundaries are more complex for shorter lengthscales. Methods for setting the hyperparameters based on the data are discussed in chapter 5.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
62
Classification
1
0.8
0.6
0.4
0.2
0
Class 1 Class 1 Laplace EP pyx
15 10 5 0 5 10
Laplace EP
4
predictive probability,
latent function, fx
8 6 4 2
input, x
a
0
2
4
8 6 4 2
input, x
b
0 2
Figure 3.6: Onedimensional toy classification dataset: Panel a shows the dataset, where points from class 1 have been plotted at 1 and class 1 at 0, together with the predictive probability for Laplaces method and the EP approximation. Also shown is the probability py1x of the data generating process. Panel b shows the corresponding distribution of the latent function fx, showing curves for the mean, and 2 standard deviations, corresponding to 95 confidence regions.
3.7.2 Onedimensional Example
Although Laplaces method and the EP approximation often give similar re sults, we here present a simple onedimensional problem which highlights some of the differences between the methods. The data, shown in Figure 3.6a, consists of 60 data points in three groups, generated from a mixture of three Gaussians, centered on 6 20 points, 0 30 points and 2 10 points, where the middle component has label 1 and the two other components label 1; all components have standard deviation 0.8; thus the two leftmost components are well separated, whereas the two rightmost components overlap.
Both approximation methods are shown with the same value of the hyperpa rameters, l2.6 and f7.0, chosen to maximize the approximate marginal likelihood for Laplaces method. Notice in Figure 3.6 that there is a consid erable difference in the value of the predictive probability for negative inputs. The Laplace approximation seems overly cautious, given the very clear separa tion of the data. This effect can be explained as a consequence of the intuition that the influence of wellexplained data points is effectively reduced, see the discussion around eq. 3.19. Because the points in the left hand cluster are relatively wellexplained by the model, they dont contribute as strongly to the posterior, and thus the predictive probability never gets very close to 1. Notice in Figure 3.6b the 95 confidence region for the latent function for Laplaces method actually includes functions that are negative at x6, which does not seem appropriate. For the positive examples centered around x2 on the righthand side of Figure 3.6b, this effect is not visible, because the points around the transition between the classes at x1 are not so wellexplained; this is because the points near the boundary are competing against the points from the other class, attempting to pull the latent function in opposite di rections. Consequently, the datapoints in this region all contribute strongly.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.7 Experiments
Another sign of this effect is that the uncertainty in the latent function, which is closely related to the effective local density of the data, is very small in the region around x1; the small uncertainty reveals a high effective density, which is caused by all data points in the region contributing with full weight. It should be emphasized that the example was artificially constructed specifically to highlight this effect.
Finally, Figure 3.6 also shows clearly the effects of uncertainty in the latent function on Eqy. In the region between x2 to x4, the latent mean in panel b increases slightly, but the predictive probability decreases in this region in panel a. This is caused by the increase in uncertainty for the latent function; when the widely varying functions are squashed through the non linearity it is possible for both classes to get high probability, and the average prediction becomes less extreme.
3.7.3 Binary Handwritten Digit Classification Example
Handwritten digit and character recognition are popular realworld tasks for testing and benchmarking classifiers, with obvious application e.g. in postal services. In this section we consider the discrimination of images of the digit 3 from images of the digit 5 as an example of binary classification; the specific choice was guided by the experience that this is probably one of the most difficult binary subtasks. 10class classification of the digits 09 is described in the following section.
We use the US Postal Service USPS database of handwritten digits which consists of 9298 segmented 1616 greyscale images normalized so that the intensity of the pixels lies in 1, 1. The data was originally split into a training set of 7291 cases and a testset of the remaining 2007 cases, and has often been used in this configuration. Unfortunately, the data in the two partitions was collected in slightly different ways, such that the data in the two sets did not stem from the same distribution.17 Since the basic underlying assumption for most machine learning algorithms is that the distribution of the training and test data should be identical, the original data partitions are not really suitable as a test bed for learning algorithms, the interpretation of the results being hampered by the change in distribution. Secondly, the original test set was rather small, sometimes making it difficult to differentiate the performance of different algorithms. To overcome these two problems, we decided to pool the two partitions and randomly split the data into two identically sized partitions of 4649 cases each. A sideeffect is that it is not trivial to compare to results obtained using the original partitions. All experiments reported here use the repartitioned data. The binary 3s vs. 5s data has 767 training cases, divided 406361 on 3s vs. 5s, while the test set has 773 cases split 418355.
We present results of both Laplaces method and EP using identical ex perimental setups. The squared exponential covariance function kx,x
17It is well known e.g. that the original test partition had more difficult cases than the training set.
63
USPS dataset
USPS repartitioned
squared exponential covariance function

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
64
Classification
Log marginal likelihood
5
4
3
2
1
0
115
115
150
200
105
100
130 200
2345
log lengthscale, logl
5
4
3
2
0.5
1
0
Information about test targets in bits
0.7
0.5
0.25
2345
log lengthscale, logl
0.25
0.7
0.84
0.8
0.8
Number of test misclassifications
19181517
15181920
23242829
5
41918161715171820222628283028292829 1918161715171820222627282827282928 1918161715171821232626252628282831 1918161716171820242625252627283131
31918161716171821232525262729313133 1918161716171821242527272931313332 1918161716171922242526293131333236 1817151716172022252629313133323637
21817151616182122273031313232363736 1817161616192223293031323236373638 1917161615202326303132323636363839 1917161617232427313232363636383940
11918171718232730323236363638394042 1919181719252930323635363839404245 1919181823263032353536383940424551 1918182024283034343638394042455160
01918212226303434363739404245516088 19202326293434353639404245516088 212223293334353639414245516089
2345 log lengthscale, logl
1918161715181920222428293030293030 1918161715181920222428293029303029 1918161715181820222528282930302628
30303029
30
20 10 0
20 10 0
a b
Training set latent means
5 0 5
latent means, f Test set latent means
5 0 5
latent means, f
c d
hyperparameters
Figure 3.7: Binary Laplace approximation: 3s vs. 5s discrimination using the USPS data. Panel a shows a contour plot of the log marginal likelihood as a function of logl and logf . The marginal likelihood has an optimum at logl2.85 and logf2.35, with an optimum value of logpyX,99. Panel b shows a contour plot of the amount of information in excess of a simple baseline model, see text about the test cases in bits as a function of the same variables. The statistical uncertainty because of the finite number of test cases is about 0.03 bits 95 confidence interval. Panel c shows a histogram of the latent means for the training and test sets respectively at the values of the hyperparameters with optimal marginal likelihood from panel a. Panel d shows the number of test errors out of 773 when predicting using the sign of the latent mean.
f2 expxx22l2 was used, so there are two free parameters, namely f the process standard deviation, which controls its vertical scaling, and the lengthscale l which controls the input lengthscale. Let logl, logfdenote the vector of hyperparameters. We first present the results of Laplaces method in Figure 3.7 and discuss these at some length. We then briefly compare these with the results of the EP method in Figure 3.8.
frequency frequency log magnitude, logf
log magnitude, logf log magnitude, logf

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.7 Experiments
Log marginal likelihood
55
65
160
4 92 200
105
Information about test targets in bits
0.84
105
0.7
0.88 0.7 0.86 0.5
95
100
33
130 160
22
0.8
11
115
200
00
2345 2345
log lengthscale, logl
a
Training set latent means
0.25
Number of test misclassifications
51919171818182125262727272827272828 1919171818182125262727282727282829 20 1919171818182125262727282828292927 1919171818182125262727282829292728 41919171818182125262727282828272728 1919171818182124262727262827272829 1919171818182124262727252527282931 1919171818182024252626242628293131 0 31919171818182024252623262829313133 100 50 0 50 1919171818182024262426282931313332 1919171818182023232429293131333236 1919171818192123242929313133323636 21919171817192323272931313332363636 1919171816192325303031333236363638 1919171817202430303232323636363839 1919171817212631323232363636383940 11919171819242731323236363638394042 1919171821252931323635363839404245 1919182023262932343536383940424551 10 1918192225293134343637394042455160 01919212325303434363739404245516087
10
20
0
100 50 0 50
latent means, f
c
latent means, f Test set latent means
Figure 3.8: The EP algorithm on 3s vs. 5s digit discrimination task from the USPS data. Panel a shows a contour plot of the log marginal likelihood as a function of the hyperparameters logl and logf. The marginal likelihood has an optimum at logl2.6 at the maximum value of logf , but the log marginal likelihood is essentially flat as a function of logfin this region, so a good point is at logf 4.1, where the log marginal likelihood has a value of 90. Panel b shows a contour plot of the amount of information in excess of the baseline model about the test cases in bits as a function of the same variables. Zero bits corresponds to no information and one bit to perfect binary generalization. The 773 test cases allows the information to be determined within 0.035 bits. Panel c shows a histogram of the latent means for the training and test sets respectively at the values of the hyperparameters with optimal marginal likelihood from panel a. Panel d shows the number of test errors out of 773 when predicting using the sign of the latent mean.
In Figure 3.7a we show a contour plot of the approximate log marginal likelihood LML logqyX, as a function of logl and logf, obtained from runs on a grid of 17 evenlyspaced values of logl and 23 evenlyspaced values of logf . Notice that there is a maximum of the marginal likelihood
Laplace results
0.8
4 0.84
0.89
log lengthscale, logl
b
frequency frequency log magnitude, logf
log magnitude, logf log magnitude, logf
20202326303534353639404245516088 212224293334353639414245516089
2 3 4 5 log lengthscale, logl
d

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
66
test log predictive probability
baseline method
interpretation of information score
error rate
Classification
near logl2.85 and logf 2.35. As will be explained in chapter 5, we would expect that hyperparameters that yield a high marginal likelihood would give rise to good predictions. Notice that an increase of 1 unit on the log scale means that the probability is 2.7 times larger, so the marginal likelihood in Figure 3.7a is fairly well peaked.
There are at least two ways we can measure the quality of predictions at the test points. The first is the test log predictive probability log2 pyx, D, . In Figure 3.7b we plot the average over the test set of the test log predictive probability for the same range of hyperparameters. We express this as the amount of information in bits about the targets, by using log to the base 2. Further, we offset the value by subtracting the amount of information that a simple baseline method would achieve. As a baseline model we use the best possible model which does not use the inputs; in this case, this model would just produce a predictive distribution reflecting the frequency of the two classes in the training set, i.e.
418773log2406767355773log23617670.9956bits, 3.76
essentially 1 bit. If the classes had been perfectly balanced, and the training and test partitions also exactly balanced, we would arrive at exactly 1 bit. Thus, our scaled information score used in Figure 3.7b would be zero for a method that did random guessing and 1 bit for a method which did perfect classification with complete confidence. The information score measures how much information the model was able to extract from the inputs about the identity of the output. Note that this is not the mutual information between the model output and the test targets, but rather the KullbackLeibler KL divergence between them. Figure 3.7 shows that there is a good qualitative agreement between the marginal likelihood and the test information, compare panels a and b.
The second and perhaps most commonly used method for measuring the quality of the predictions is to compute the number of test errors made when using the predictions. This is done by computing Eqy see eq. 3.25 for each test point, thresholding at 12 to get hard predictions and counting the number of errors. Figure 3.7d shows the number of errors produced for each entry in the 1723 grid of values for the hyperparameters. The general trend in this table is that the number of errors is lowest in the top lefthand corner and increases as one moves right and downwards. The number of errors rises dramatically in the far bottom righthand corner. However, note in general that the number of errors is quite small there are 773 cases in the test set.
The qualitative differences between the two evaluation criteria depicted in Figure 3.7 panels b and d may at first sight seem alarming. And although panels a and b show similar trends, one may worry about using a to select the hyperparameters, if one is interested in minimizing the test misclassification rate. Indeed a full understanding of all aspects of these plots is quite involved, but as the following discussion suggests, we can explain the major trends.
First, bear in mind that the effect of increasing l is to make the kernel function broader, so we might expect to observe effects like those in Figure 3.5

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.7 Experiments
where large widths give rise to a lack of flexibility. Keeping l constant, the effect of increasing f is to increase the magnitude of the values obtained for f. By itself this would lead to harder predictions i.e. predictive probabilities closer to 0 or 1, but we have to bear in mind that the variances associated will also increase and this increased uncertainty for the latent variables tends to soften the predictive probabilities, i.e. move them closer to 12.
The most marked difference between Figure 3.7b and d is the behaviour in the the top left corner, where classification error rate remains small, but the test information and marginal likelihood are both poor. In the left hand side of the plots, the length scale l is very short. This causes most points to be deemed far away from most other points. In this regime the prediction is dominated by the classlabel of the nearest neighbours, and for the task at hand, this happens to give a low misclassification rate. In this parameter region the test latent variables f are very close to zero, corresponding to probabilities very close to 12. Consequently, the predictive probabilities carry almost no information about the targets. In the top left corner, the predictive probabilities for all 773 test cases lie in the interval 0.48, 0.53. Notice that a large amount of information implies a high degree of correct classification, but not vice versa. At the optimal marginal likelihood values of the hyperparameters, there are 21 misclassifications, which is slightly higher that the minimum number attained which is 15 errors.
In exercise 3.10.6 readers are encouraged to investigate further the behaviour of f and the predictive probabilities etc. as functions of logl and logffor themselves.
In Figure 3.8 we show the results on the same experiment, using the EP method. The findings are qualitatively similar, but there are significant dif ferences. In panel a the approximate log marginal likelihood has a different shape than for Laplaces method, and the maximum of the log marginal likeli hood is about 9 units on a natural log scale larger i.e. the marginal probability is exp98000 times higher. Also note that the marginal likelihood has a ridge for log l2.6 that extends into large values of log f . For these very large latent amplitudes see also panel c the probit likelihood function is well approximated by a step function since it transitions from low to high values in the domain 3,3. Once we are in this regime, it is of course irrelevant exactly how large the magnitude is, thus the ridge. Notice, however, that this does not imply that the prediction will always be hard, since the variance of the latent function also grows.
Figure 3.8 shows a good qualitative agreement between the approximate log marginal likelihood and the test information, compare panels a and b. The best value of the test information is significantly higher for EP than for Laplaces method. The classification error rates in panel d show a fairly similar behaviour to that of Laplaces method. In Figure 3.8c we show the latent means for training and test cases. These show a clear separation on the training set, and much larger magnitudes than for Laplaces method. The absolute values of the entries in f are quite large, often well in excess of 50, which may suggest very hard predictions probabilities close to zero or one,
67
EP results

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
68
Classification
errorreject curve
Figure 3.9: MAP vs. averaged predictions for the EP algorithm for the 3s vs. 5s digit discrimination using the USPS data. The optimal values of the hyperparameters from Figure 3.8a logl2.6 and logf 4.1 are used. The MAP predictions Eqfy are hard, mostly being very close to zero or one. On the other hand, the averaged predictions Eqy from eq. 3.25 are a lot less extreme. In panel a the 21 cases that were misclassified are indicated by crosses correctly classified cases are shown by points. Note that only 4 of the 21 misclassified points have confident predictions i.e. outside 0.1,0.9. Notice that all points fall in the triangles below and above the horizontal line, confirming that averaging does not change the most probable class, and that it always makes the probabilities less extreme i.e. closer to 12. Panel b shows histograms of averaged and MAP predictions, where we have truncated values over 30.
since the sigmoid saturates for smaller arguments. However, when taking the uncertainties in the latent variables into account, and computing the predictions using averaging as in eq. 3.25 the predictive probabilities are softened. In Figure 3.9 we can verify that the averaged predictive probabilities are much less extreme than the MAP predictions.
In order to evaluate the performance of the two approximate methods for GP classification, we compared to a linear probit model, a support vector ma chine, a leastsquares classifier and a nearest neighbour approach, all of which are commonly used in the machine learning community. In Figure 3.10 we show errorreject curves for both misclassification rate and the test information mea sure. The errorreject curve shows how the performance develops as a function of the fraction of test cases that is being rejected. To compute these, we first modify the methods that do not naturally produce probabilistic predictions to do so, as described below. Based on the predictive probabilities, we reject test cases for which the maximum predictive probability is smaller than a threshold. Varying the threshold produces the errorreject curve.
The GP classifiers applied in Figure 3.10 used the hyperparameters which optimized the approximate marginal likelihood for each of the two methods. For the GP classifiers there were two free parameters f and l. The linear pro bit model linear logistic models are probably more common, but we chose the probit here, since the other likelihood based methods all used probit can be
linear probit model
1
0.8
0.6
0.4
25 20 15 10
5
frequency
frequency
averaged
20 0.2 15 10 5
00
0 0.2 0.4 0.6 0.8 1MAP
a
0 0.2 0.4 0.6 0.8 1
averaged b
0
0 0.2 0.4 0.6 0.8 1
25
MAP

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.7 Experiments
0.03
0.02
0.01
0
0 0.1
69
EP Laplace SVM P1NN LSC
lin probit
0.3
1
0.95
0.9
0.85
EP Laplace SVM P1NN LSC
lin probit
misclassification rate
test information, bits
rejection rate
a
0.2
0 0.2 0.4 0.6 0.8 1
rejection rate
b
Figure 3.10: Panel a shows the errorreject curve and panel b the amount of information about the test cases as a function of the rejection rate. The probabilistic one nearest neighbour P1NN method has much worse performance than the other methods. Gaussian processes with EP behaves similarly to SVMs although the clas sification rate for SVM for low rejection rates seems to be a little better. Laplaces method is worse than EP and SVM. The GP least squares classifier LSC described in section 6.5 performs the best.
implemented as GP model using Laplaces method, which is equivalent to al though not computationally as efficient as iteratively reweighted least squares IRLS. The covariance function kx,x2xx has a single hyperparam eter, , which was set by maximizing the log marginal likelihood. This gives logpyX,105, at 2.0, thus the marginal likelihood for the linear covariance function is about 6 units on a natural log scale lower than the max imum log marginal likelihood for the Laplace approximation using the squared exponential covariance function.
The support vector machine SVM classifier see section 6.4 for further de tails on the SVM used the same SE kernel as the GP classifiers. For the SVM the role of l is identical, and the tradeoff parameter C in the SVM formulation see eq. 6.37 plays a similar role to f2 . We carried out 5fold cross validation on a grid in parameter space to identify the best combination of parameters w.r.t. the error rate; this turned out to be at C1, l10. Our experiments were conducted using the SVMTorch software Collobert and Bengio, 2001. In order to compute probabilistic predictions, we squashed the testactivities through a cumulative Gaussian, using the methods proposed by Platt 2000: we made a parameterized linear transformation of the testactivities and fed this through the cumulative Gaussian.18 The parameters of the linear trans formation were chosen to maximize the log predictive probability, evaluated on the holdout sets of the 5fold cross validation.
The probabilistic one nearest neighbour P1NN method is a simple nat ural extension to the classical one nearest neighbour method which provides probabilistic predictions. It computes the leaveoneout LOO one nearest neighbour prediction on the training set, and records the fraction of caseswhere the LOO predictions were correct. On test cases, the method then pre
18Platt 2000 used a logistic whereas we use a cumulative Gaussian.
support vector machine
probabilistic one nearest neighbour

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
70
Classification
dicts the one nearest neighbour class with probability , and the other class with probability 1. Rejections are based on thresholding on the distance to the nearest neighbour.
The leastsquares classifier LSC is described in section 6.5. In order to produce probabilistic predictions, the method of Platt 2000 was used as de scribed above for the SVM using the predictive means only the predictive variances were ignored19, except that instead of the 5fold cross validation, leaveoneout crossvalidation LOOCV was used, and the kernel parameters were also set using LOOCV.
Figure 3.10 shows that the three best methods are the EP approximation for GPC, the SVM and the leastsquares classifier LSC. Presenting both the error rates and the test information helps to highlight differences which may not be apparent from a single plot alone. For example, Laplaces method and EP seem very similar on error rates, but quite different in test information. Notice also, that the errorreject curve itself reveals interesting differences, e.g. notice that although the P1NN method has an error rate comparable to other methods at zero rejections, things dont improve very much when rejections are allowed. Refer to section 3.8 for more discussion of the results.
3.7.4 10class Handwritten Digit Classification Example
We apply the multiclass Laplace approximation developed in section 3.5 to the 10class handwritten digit classification problem from the repartitioned USPS dataset, having n4649 training cases and n4649 cases for testing, see page 63. We used a squared exponential covariance function with two hyper parameters: a single signal amplitude f , common to all 10 latent functions, and a single lengthscale parameter l, common to all 10 latent functions and common to all 256 input dimensions.
The behaviour of the method was investigated on a grid of values for the hyperparameters, see Figure 3.11. Note that the correspondence between the log marginal likelihood and the test information is not as close as for Laplaces method for binary classification in Figure 3.7 on page 64. The maximum value of the log marginal likelihood attained is 1018, and for the hyperparameters corresponding to this point the error rate is 3.1 and the test information 2.67 bits. As with the binary classification problem, the test information is standardized by subtracting off the negative entropy information of the targets which is 3.27 bits. The classification error rate in Figure 3.11c shows a clear minimum, and this is also attained at a shorter lengthscale than where the marginal likelihood and test information have their maxima. This effect was also seen in the experiments on binary classification.
To gain some insight into the level of performance we compared these re sults with those obtained with the probabilistic one nearest neighbour method P1NN, a multiple logistic regression model and a SVM. The P1NN first uses an
19Of course, one could also have tried a variant where the full latent predictive distribution was averaged over, but we did not do that here.

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.7 Experiments
71
Log marginal likelihood
5
4
33000
2
1
03000
1100 1050
3000
1300
1500
2000
1200
2345
log lengthscale, logl
a
5
4
3
2
1
0
Information about the test targets in bits
2.5
2
1
1 2
2.95
2.8
2.5
2.8
2.99
2345
log lengthscale, logl
5
4
3
Test set misclassification percentage
10 4
2.8
2 2.7 2.6
1
0
5
3 3.3
10
2345
log lengthscale, logl
b
c
Figure 3.11: 10way digit classification using the Laplace approximation. Panel a shows the approximate log marginal likelihood, reaching a maximum value of log pyX, 1018 at log l2.35 and log f2.6. In panel b information about the test cases is shown. The maximum possible amount of information about the test targets, corresponding to perfect classification, would be 3.27 bits the entropy of the targets. At the point of maximum marginal likelihood, the test information is 2.67 bits. In panel c the test set misclassification rate is shown in percent. At the point of maximum marginal likelihood the test error rate is 3.1.
internal leaveoneout assessment on the training set to estimate its probabil ity of being correct, . For the test set it then predicts the nearest neighbour with probabilityand all other classes with equal probability 19. We obtained 0.967, a test information of 2.98 bits and a test set classification error rate of 3.0.
We also compare to multiple linear logistic regression. One way to imple ment this method is to view it as a Gaussian process with a linear covariance
log magnitude, logf
log magnitude, logf
log magnitude, logf

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
72
Classification
function, although it is equivalent and computationally more efficient to do the Laplace approximation over the weights of the linear model. In our case there are 10257 weights 256 inputs and one bias, whereas there are 104696 latent function values in the GP. The linear covariance function kx, x2xx has a single hyperparameterused for all 10 latent functions. Optimizing the log marginal likelihood w.r.t.gives log pyX, 1339 at 1.45. Using this value for the hyperparameter, the test information is 2.95 bits and the test set error rate is 5.7.
Finally, a support vector machine SVM classifier was trained using the same SE kernel as the Gaussian process classifiers. See section 6.4 for further details on the SVM. As in the binary SVM case there were two free parameters l the lengthscale of the kernel, and the tradeoff parameter C see eq. 6.37, which plays a similar role to f2 . We carried out 5fold crossvalidation on a grid in parameter space to identify the best combination of parameters w.r.t. the error rate; this turned out to be at C1, l5. Our experiments were conducted using the SVMTorch software Collobert and Bengio, 2001, which implements multiclass SVM classification using the oneversusrest method de scribed in section 6.5. The test set error rate for the SVM is 2.2; we did not attempt to evaluate the test information for the multiclass SVM.
3.8 Discussion
In the previous section we presented several sets of experiments comparing the two approximate methods for inference in GPC models, and comparing them to other commonlyused supervised learning methods. In this section we discuss the results and attempt to relate them to the properties of the models.
For the binary examples from Figures 3.7 and 3.8, we saw that the two ap proximations showed quite different qualitative behaviour of the approximated log marginal likelihood, although the exact marginal likelihood is of course iden tical. The EP approximation gave a higher maximum value of the log marginal likelihood by about 9 units on the log scale and the test information was somewhat better than for Laplaces method, although the test set error rates were comparable. However, although this experiment seems to favour the EP approximation, it is interesting to know how close these approximations are to the exact analytically intractable solutions. In Figure 3.12 we show the results of running a sophisticated Markov chain Monte Carlo method called Annealed Importance Sampling Neal, 2001 carried out by Kuss and Rasmussen 2005. The USPS dataset for these experiments was identical to the one used in Fig ures 3.7 and 3.8, so the results are directly comparable. It is seen that the MCMC results indicate that the EP method achieves a very high level of accu racy, i.e. that the difference between EP and Laplaces method is caused almost exclusively by approximation errors in Laplaces method.
The main reason for the inaccuracy of Laplaces method is that the high dimensional posterior is skew, and that the symmetric approximation centered on the mode is not characterizing the posterior volume very well. The posterior
Monte Carlo results

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.8 Discussion
73
5
160 4200
130
Log marginal likelihood
95
105 115
92
100
3
2
1
0
160
200
2345
log lengthscale, logl
105
Information about test targets in bits
5
0.8 0.84
3
2
1
0.7
0
0.8
4
4
0.89
0.88 0.86
0.8
0.7 0.5
0.25
2345
log lengthscale, logl
a b
Figure 3.12: The log marginal likelihood, panel a, and test information, panel b, for the USPS 3s vs. 5s binary classification task computed using Markov chain Monte Carlo MCMC. Comparing this to the Laplace approximation Figure 3.7 and Figure 3.8 shows that the EP approximation is surprisingly accurate. The slight wiggliness of the contour lines are caused by finite sample effects in the MCMC runs.
is a combination of the correlated Gaussian prior centered on the origin and the likelihood terms which softly cut off halfspaces which do not agree with the training set labels. Therefore the posterior looks like a correlated Gaussian restricted to the orthant which agrees with the labels. Its mode will be located close to the origin in that orthant, and it will decrease rapidly in the direction towards the origin due to conflicts from the likelihood terms, and decrease only slowly in the opposite direction because of the prior. Seen in this light it is not surprising that the Laplace approximation is somewhat inaccurate. This explanation is corroborated further by Kuss and Rasmussen 2005.
It should be noted that all the methods compared on the binary digits clas sification task except for the linear probit model are using the squared distance between the digitized digit images measured directly in the image space as the sole input to the algorithm. This distance measure is not very well suited for the digit discrimination taskfor example, two similar images that are slight translations of each other may have a huge squared distance, although of course identical labels. One of the strengths of the GP formalism is that one can use prior distributions over latent, in this case functions, and do inference based on these. If however, the prior over functions depends only on one particular as pect of the data the squared distance in image space which is not so well suited for discrimination, then the prior used is also not very appropriate. It would be more interesting to design covariance functions parameterized by hyperparame ters which are more appropriate for the digit discrimination task, e.g. reflecting on the known invariances in the images, such as the tangentdistance ideas from Simard et al. 1992; see also Sch olkopf and Smola 2002, ch. 11 and section 9.10. The results shown here follow the common approach of using a generic
suitablility of the covariance function
log magnitude, logf
log magnitude, logf

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
74
Classification
covariance function with a minimum of hyperparameters, but this doesnt allow us to incorporate much prior information about the problem. For an example in the GP framework for doing inference about multiple hyperparameters with more complex covariance functions which provide clearly interpretable infor mation about the data, see the carbon dioxide modelling problem discussed on page 118.
3.9 Appendix: Moment Derivations
Consider the integral of a cumulative Gaussian, , with respect to a Gaussian
xm2 x
Z Nx, dx, where xNydy, 3.77
v
initially for the special case v0. Writing out in full, substituting zyx
m and wx and interchanging the order of the integrals
Zv0

expdwdz,
3.78
1x 2v
ym2 2v2
x2 22
exp
dydx
1 or in matrix notation

zw2
w2
m
2v2v2 22
m 1 1 1
11w22 2w
Z exp vv dwdz v0 11
2v 2z v2 v2 z
m 2 2w
N
z0, 2 v22 dw dz, 3.79 i.e. an incomplete integral over a joint Gaussian. The inner integral corre
sponds to marginalizing over w see eq. A.6, yielding
1 mz2m
Zv0 2 2 exp 2v22 dz v22 , 2v
3.80 which assumed v0. If v is negative, we can substitute the symmetry z
1z into eq. 3.77 to get
Zv01 v22 v22 .
mm Collecting the two cases, eq. 3.80 and eq. 3.81 we arrive at
3.81
2, 3.82 3.83
xm 2
Z v Nx, dxz, where z
for general v0. We wish to compute the moments of
qxZ1xmNx,2, v
m
2
v 1v

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.10 Exercises
where Z is given in eq. 3.82. Perhaps the easiest way to do this is to differ entiate w.r.t.on both sides of eq. 3.82
Zx xm 2
2v Nx, dxz3.84
1 xm 2 Z Nz
2 xvNx,dx2 22,
v 1v
where we have used zNzz. We recognize the first term in the integral in the top line of eq. 3.84 as Z2 times the first moment of q which we are seeking. Multiplying through by 2Z and rearranging we obtain
2Nz
Eqx2 2. 3.85
zv 1v
Similarly, the second moment can be obtained by differentiating eq. 3.82 twice
75
first moment
2Z x2 2x 2 1 xm 24442v
2 zNz Nx, dxv2 2
Eqx22Eqx22
4zNz , 3.86 zv22
where the first and second terms of the integral in the top line of eq. 3.86 are multiples of the first and second moments. The second central moment after reintroducing eq. 3.85 into eq. 3.86 and simplifying is given by
EqxEqx2Eqx2Eqx22 4Nz zNz. 3.87 v22z z
3.10 Exercises
1. For binary GPC, show the equivalence of using a noisefree latent process
combined with a probit likelihood and a latent process with Gaussian
noise combined with a stepfunction likelihood. Hint: introduce explicitly
additionalnoisylatentvariablesf ,whichdifferfromf byGaussiannoise. ii
Write down the step function likelihood for a single case as a function of f , integrate out the noisy variable, to arrive at the probit likelihood as a
function of the noisefree process.
2. Consider a multinomial random variable y having C states, with yc1 if the variable is in state c, and 0 otherwise. State c occurs with probability c. Show that covyEyydiag. Ob serve that covy, being a covariance matrix, must necessarily be positive semidefinite. Using this fact show that the matrix Wdiag from eq. 3.38 is positive semidefinite. By showing that the vector of all ones is an eigenvector of covy with eigenvalue zero, verify that the ma trix is indeed positive semidefinite, and not positive definite. See section 4.1 for definitions of positive semidefinite and positive definite matrices.
second moment
i

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
76
Classification
3
0
3
3 0 3
z3
R3
R1
z2
R2
Figure 3.13: The decision regions for the threeclass softmax function in z2z3 space. 3. Consider the 3class softmax function
pCcexpfc , expf1expf2expf3
where c1, 2, 3 and f1,f2,f3 are the corresponding activations. To more easily visualize the decision boundaries, let z2f2f1 and z3
f3 f1. Thus
pC11 , 3.88 1expz2expz3
and similarly for the other classes. The decision boundary relating to pC113 is the curve expz2expz32. The decision regions for the three classes are illustrated in Figure 3.13. Let ff1, f2, f3 have a Gaussian distribution centered on the origin, and let fsoftmaxf. We now consider the effect of this distribution on fpfdf. For a Gaussian with given covariance structure this integral is easily approxi mated by drawing samples from pf. Show that the classification can be made to fall into any of the three categories depending on the covariance matrix. Thus, by considering displacements of the mean of the Gaussian byfrom the origin into each of the three regions we have shown that overall classification depends not only on the mean of the Gaussian but also on its covariance. Show that this conclusion is still valid when it is recalled that z is derived from f as zT f where
1 0 1T 0 1 1 ,
so that covzTcovfT.
4. Consider the update equation for fnew given by eq. 3.18 when some of the training points are wellexplained under f so that tii and Wii0

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
3.10 Exercises 77
for these points. Break f into two subvectors, f1 that corresponds to points that are not wellexplained, and f2 to those that are. Rewrite K1W1 fromeq.3.18asKIWK1 andletKbepartitioned as K11, K12, K21, K22 and similarly for the other matrices. Using the partitioned matrix inverse equations see section A.3 show that
fnewK I W K 1W f logpy f , 111111111111 11
fnewK K1fnew. 2 21 11 1
3.89
See section 3.4.1 for the consequences of this result.
5. Show that the expressions in eq. 3.56 for the cavity mean i and vari
ance 2 do not depend on the approximate likelihood terms and2 i ii
for the corresponding case, despite the appearance of eq. 3.56.
6. Consider the USPS 3s vs. 5s prediction problem discussed in section 3.7.3. Use the implementation of the Laplace binary GPC provided to investi gate how f and the predictive probabilities etc. vary as functions of logl and logf .

C. E. RasmussenC. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] R C algorithm parallel database statistic software Bayesian theory C. E. Rasmussen C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.orggpml
$25