[SOLVED] C algorithm math MPI graph network Bayesian theory Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design

$25

File Name: C_algorithm_math_MPI_graph_network_Bayesian_theory_Gaussian_Process_Optimization_in_the_Bandit_Setting:_No_Regret_and_Experimental_Design.zip
File Size: 1290.54 KB

5/5 - (1 vote)

Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
Niranjan Srinivas
Andreas Krause
California Institute of Technology, Pasadena, CA, USA
Sham Kakade
University of Pennsylvania, Philadelphia, PA, USA
Matthias Seeger
Saarland University, Saarbru cken, Germany
Abstract
Many applications require optimizing an un known, noisy function that is expensive to evaluate. We formalize this task as a multi armed bandit problem, where the payoff function is either sampled from a Gaussian process GP or has low RKHS norm. We resolve the impor tant open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GPUCB, an intuitive upperconfidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental de sign. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covari ance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GPUCB compares favorably with other heuristical GP optimization approaches.
1. Introduction
In most stochastic optimization settings, evaluating the unknown function is expensive, and sampling is to be minimized. Examples include choosing advertisements in sponsored search to maximize profit in a clickthrough model PandeyOlston, 2007 or learning optimal control strategies for robots Lizotte et al., 2007. Predominant approaches to this problem include the multiarmed bandit paradigm Robbins, 1952, where the goal is to maximize cumulative reward by optimally balancing exploration and exploitation, and experimental design ChalonerVerdinelli, 1995, where the function is to be explored globally with as few evaluations
Appearing in Proceedings of the 27th International Confer ence on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the authorsowners.
niranjancaltech.edu krauseacaltech.edu
skakadewharton.upenn.edu mseegermmci.unisaarland.de
as possible, for example by maximizing information gain. The challenge in both approaches is twofold: we have to estimate an unknown function f from noisy samples, and we must optimize our estimate over some highdimensional input space. For the former, much progress has been made in machine learning through kernel methods and Gaussian process GP models RasmussenWilliams, 2006, where smoothness assumptions about f are encoded through the choice of kernel in a flexible nonparametric fashion. Beyond Euclidean spaces, kernels can be defined on diverse domains such as spaces of graphs, sets, or lists.
We are concerned with GP optimization in the multi armed bandit setting, where f is sampled from a GP distribution or has low complexity measured in terms of its RKHS norm under some kernel. We pro vide the first sublinear regret bounds in this nonpara metric setting, which imply convergence rates for GP optimization. In particular, we analyze the Gaussian Process Upper Confidence Bound GPUCB algo rithm, a simple and intuitive Bayesian method Auer et al., 2002; Auer, 2002; Dani et al., 2008. While objectives are different in the multiarmed bandit and experimental design paradigm, our results draw a close technical connection between them: our regret bounds come in terms of an information gain quantity, measuring how fast f can be learned in an information theoretic sense. The submodularity of this function allows us to prove sharp regret bounds for particular covariance functions, which we demonstrate for com monly used Squared Exponential and Mat ern kernels.
Related Work. Our work generalizes stochastic linear optimization in a bandit setting, where the un known function comes from a finitedimensional linear space. GPs are nonlinear random functions, which can be represented in an infinitedimensional linear space. For the standard linear setting, Dani et al. 2008

provide a nearcomplete characterization explicitly dependent on the dimensionality. In the GP setting, the challenge is to characterize complexity in a differ ent manner, through properties of the kernel function. Our technical contributions are twofold: first, we show how to analyze the nonlinear setting by focusing on the concept of information gain, and second, we explicitly bound this information gain measure using the concept of submodularity Nemhauser et al., 1978 and knowledge about kernel operator spectra.
Kleinberg et al. 2008 provide regret bounds un der weaker and less configurable assumptions only Lipschitzcontinuity w.r.t. a metric is assumed; Bubeck et al. 2008 consider arbitrary topological
spaces, which however degrade rapidly with the di
d1
mensionality of the problem T d2 . In practice,
linearity w.r.t. a fixed basis is often too stringent an assumption, while Lipschitzcontinuity can be too coarsegrained, leading to poor rate bounds. Adopting GP assumptions, we can model levels of smoothness in a finegrained way. For example, our rates for the fre quently used Squared Exponential kernel, enforcing a high degree of smoothness, have weak dependence on the dimensionality: OT log T d1 see Fig. 1.
There is a large literature on GP response surface optimization. Several heuristics for trading off explo ration and exploitation in GP optimization have been proposed such as Expected Improvement, Mockus et al. 1978, and Most Probable Improvement, Mockus 1989 and successfully applied in practice c.f., Lizotte et al. 2007. Brochu et al. 2009 provide a comprehen sive review of and motivation for Bayesian optimiza tion using GPs. The Efficient Global Optimization EGO algorithm for optimizing expensive blackbox functions is proposed by Jones et al. 1998 and ex tended to GPs by Huang et al. 2006. Little is known about theoretical performance of GP optimization. While convergence of EGO is established by VazquezBect 2007, convergence rates have remained elu sive. Gru new alder et al. 2010 consider the pure ex ploration problem for GPs, where the goal is to find the optimal decision over T rounds, rather than maximize cumulative reward with no explorationexploitation dilemma. They provide sharp bounds for this explo ration problem. Note that this methodology would not lead to bounds for minimizing the cumulative regret. Our cumulative regret bounds translate to the first performance guarantees rates for GP optimization.
Summary. Our main contributions are:
We analyze GPUCB, an intuitive algorithm for GP optimization, when the function is either sam pled from a known GP, or has low RKHS norm.
Figure 1. Our regret bounds up to polylog factors for lin ear, radial basis, and Mat ern kernelsd is the dimension, T is the time horizon, andis a Mat ern parameter.
We bound the cumulative regret for GPUCB in terms of the information gain due to sampling, establishing a novel connection between experi mental design and GP optimization.
By bounding the information gain for popular classes of kernels, we establish sublinear regret bounds for GP optimization for the first time. Our bounds depend on kernel choice and param eters in a finegrained fashion.
We evaluate GPUCB on sensor network data, demonstrating that it compares favorably to ex isting algorithms for GP optimization.
2. Problem Statement and Background
Consider the problem of sequentially optimizing an un
known reward function f : DR: in each round t, we
choose a point xtD and get to see the function value
there, perturbed by noise: ytfxtt. Our goal is
to maximize the sum of rewards T fx , thus to t1 t
perform essentially as well as xargmaxxD fx as rapidly as possible. For example, we might want to find locations of highest temperature in a building by sequentially activating sensors in a spatial network and regressing on their measurements. D consists of all sensor locations, fx is the temperature at x, and sensor accuracy is quantified by the noise variance. Each activation draws battery power, so we want to sample from as few sensors as possible.
Regret. A natural performance metric in this con text is cumulative regret, the loss in reward due to not knowing f s maximum points beforehand. Suppose the unknown function is f, its maximum point1 xargmaxxD fx. For our choice xt in round t, we incur instantaneous regret rtf x f xt . The cumulative regret RT after T rounds is the sum of instantaneous regrets: RTTt1 rt. A desirable asymptotic property of an algorithm is to be noregret: limTRT T0. Note that neither rt nor RT are ever revealed to the algorithm. Bounds on the average regret RT T translate to convergence rates for GP optimization: the maximum maxtT f xtin the first T rounds is no further from fx than the average.
1 x need not be unique; only fx occurs in the regret.
Gaussian Process Optimization in the Bandit Setting
Kernel
Linear
RBF
Matern
Regret RT

kernel
dT
T log T d1
dd1 kernel
T 2dd1

2.1. Gaussian Processes and RKHSs
Gaussian Processes. Some assumptions on f are required to guarantee noregret. While rigid paramet ric assumptions such as linearity may not hold in prac tice, a certain degree of smoothness is often warranted. In our sensor network, temperature readings at closeby locations are highly correlated see Figure 2a. We can enforce implicit properties like smoothness with out relying on any parametric assumptions, modeling f as a sample from a Gaussian process GP: a col lection of dependent random variables, one for each xD, every finite subset of which is multivariate Gaussian distributed in an overall consistent way Ras mussenWilliams, 2006. A GP x, kx, x is specified by its mean function xEfx and covariance or kernel function kx, xEfxxf x x . For GPs not conditioned on data, we assume2 that 0. Moreover, we restrict kx, x1, xD, i.e., we assume bounded variance. By fixing the correlation behavior, the covariance func tion k encodes smoothness properties of sample func tions f drawn from the GP. A range of commonly used kernel functions is given in Section 5.2.
In this work, GPs play multiple roles. First, some of our results hold when the unknown target function is a sample from a known GP distribution GP0, kx, x. Second, the Bayesian algorithm we analyze generally uses GP0, kx, x as prior distribution over f. A major advantage of working with GPs is the exis tence of simple analytic formulae for mean and co variance of the posterior distribution, which allows easy implementation of algorithms. For a noisy sam ple yTy1 yT T at points ATx1,,xT , ytfxtt with tN0,2 i.i.d. Gaussian noise, the posterior over f is a GP distribution again, with mean T x, covariance kT x, xand variance T2 x:
TxkTxTKT 2I1yT, 1 kTx,xkx,xkTxTKT 2I1kTx,
T2 xkT x, x, 2 where kT xkx1,x kxT ,xT and KT is
the positive definite kernel matrix kx,xx,xAT .
RKHS. Instead of the Bayes case, where f is sam pled from a GP prior, we also consider the more ag nostic case where f has low complexity as measured under an RKHS norm and distribution free assump tions on the noise process. The notion of reproduc ing kernel Hilbert spaces RKHS, Wahba 1990 is in timately related to GPs and their covariance func tions kx,x. The RKHS HkD is a complete sub space of L2D of nicely behaved functions, with an
2This is w.l.o.g. RasmussenWilliams, 2006.
inner product , k obeying the reproducing property: f, kx, kfx for all fHkD. The induced RKHS norm fkf,fk measures smoothness of f w.r.t. k: in much the same way as k1 would generate smoother samples than k2 as GP covariance functions, k1 assigns larger penalties than k2 . , k can be extended to all of L2D, in which case fk iff fHkD. For most kernels discussed in Section 5.2, members of HkD can uniformly approximate any continuous function on any compact subset of D.
2.2. Information GainExperimental Design
One approach to maximizing f is to first choose points xt so as to estimate the function globally well, then play the maximum point of our estimate. How can we learn about f as rapidly as possible? This question comes down to Bayesian Experimental Design henceforth ED; see ChalonerVerdinelli 1995, where the informativeness of a set of sampling points AD about f is measured by the information gain c.f., CoverThomas 1991, which is the mutual information between f and observations yAfA A at these points:
IyA; fHyAHyAf, 3
quantifying the reduction in uncertainty about f
from revealing yA. Here, fAfxxA and
AN0,2I. For a Gaussian, HN,
Gaussian Process Optimization in the Bandit Setting
1 log 2e, so that in our setting IyA; f2
IyA;fA1 logI2KA, where KA
2
kx, xx,xA.
maximizer among AD, AT is NPhard Ko et al., 1995, it can be approximated by an efficient greedy algorithm. If FAIyA;f, this algorithm picks xtargmaxxD F At1 x in round t, which can be shown to be equivalent to
xtargmax t1x, 4 xD
where At1x1, . . . , xt1. Importantly, this simple algorithm is guaranteed to find a nearoptimal solution: for the set AT obtained after T rounds, we have that
FAT 11e max FA, 5 AT
at least a constant fraction of the optimal infor mation gain value. This is because FA satisfies a diminishing returns property called submodularity KrauseGuestrin, 2005, and the greedy approxima tion guarantee 5 holds for any submodular function Nemhauser et al., 1978.
While sequentially optimizing Eq. 4 is a provably good way to explore f globally, it is not well suited for func tion optimization. For the latter, we only need to iden tify points x where fx is large, in order to concen
While finding the information gain

trate sampling there as rapidly as possible, thus exploit our knowledge about maxima. In fact, the ED rule 4 does not even depend on observations yt obtained along the way. Nevertheless, the maximum informa tion gain after T rounds will play a prominent role in our regret bounds, forging an important connection between GP optimization and experimental design.
3. GPUCB Algorithm
For sequential optimization, the ED rule 4 can be wasteful: it aims at decreasing uncertainty globally, not just where maxima might be. Another idea is to pick points as xtargmaxxD t1x, maximizing the expected reward based on the posterior so far. However, this rule is too greedy too soon and tends to get stuck in shallow local optima. A combined strategy is to choose
xargmaxx12 x, 6 t t1tt1
xD
where t are appropriate constants. This latter objec
tive prefers both points x where f is uncertain large
t1 and such where we expect to achieve high
rewards large t1: it implicitly negotiates the
explorationexploitation tradeoff. A natural interpre
tation of this sampling rule is that it greedily selects
points x such that fx should be a reasonable upper
bound on fx, since the argument in 6 is an upper
quantile of the marginal posterior Pfxyt1. We
call this choice the Gaussian process upper confidence
bound rule GPUCB, where t is specified depending
on the context see Section 4. Pseudocode for
the GPUCB algorithm is provided in Algorithm 1.
Figure 2 illustrates two subsequent iterations, where
GPUCB both explores Figure 2b by sampling an
input x with large 2 x and exploits Figure 2c t1
by sampling x with large t1x.
The GPUCB selection rule Eq. 6 is motivated by the
UCB algorithm for the classical multiarmed bandit problem Auer et al., 2002; KocsisSzepesv ari, 2006. Among competing criteria for GP optimization see Section 1, a variant of the GPUCB rule has been demonstrated to be effective for this application Dorard et al., 2009. To our knowledge, strong theoretical results of the kind provided for GPUCB in this paper have not been given for any of these search heuristics. In Section 6, we show that in practice GPUCB compares favorably with these alternatives.
If D is infinite, finding xt in 6 may be hard: the upper confidence index is multimodal in general. However, global search heuristics are very effective in practice Brochu et al., 2009. It is generally assumed that evaluating f is more costly than maximizing the UCB index.
Algorithm 1 The GPUCB algorithm.
Input: Input space D; GP Prior 00, 0, k for t1,2, do
Choose xtargmax t1xtt1x xD
Sample ytfxtt
Perform Bayesian update to obtain t and t end for
UCB algorithms and GP optimization techniques in general have been applied to a large number of problems in practice KocsisSzepesva ri, 2006; PandeyOlston, 2007; Lizotte et al., 2007. Their performance is well characterized in both the finite arm setting and the linear optimization setting, but no convergence rates for GP optimization are known.
4. Regret Bounds
We now establish cumulative regret bounds for GP optimization, treating a number of different settings: fGP0, kx, x for finite D, fGP0, kx, x for general compact D, and the agnostic case of arbi trary f with bounded RKHS norm.
GP optimization generalizes stochastic linear opti mization, where a function f from a finitedimensional linear space is optimized over. For the linear case, Dani et al. 2008 provide regret bounds that explicitly de pend on the dimensionality3 d. GPs can be seen as random functions in some infinitedimensional linear space, so their results do not apply in this case. This problem is circumvented in our regret bounds. The quantity governing them is the maximum information gain T after T rounds, defined as:
T : max IyA;fA, 7 AD:AT
Gaussian Process Optimization in the Bandit Setting
where IyA; fAIyA; f is defined in 3. Recall thatIy;f1logI2K,whereK
AA2AA kx, x x,xA is the covariance matrix of fA
fxxA associated with the samples A. Our regret bounds are of the form OT T T , where T is the confidence parameter in Algorithm 1, while the bounds of Dani et al. 2008 are of the form OT T d d the dimensionality of the linear function space. Here and below, the O notation is a variant of O, where log factors are suppressed. While our proofsall pro vided in the longer version Srinivas et al., 2009use techniques similar to those of Dani et al. 2008, we face a number of additional significant technical chal lenges. Besides avoiding the finitedimensional analy sis, we must handle confidence issues, which are more
3 In general, d is the dimensionality of the input space D, which in the finitedimensional linear case coincides with the feature space.

Temperature C
25
20
15 0
5 4 3 2 1 0
1 2 3
10 305
400
5 4 3 2 1 0
1 2 3 4 5
30
10 204
20
a Temperature data
6 4 2 0 2 4 6
b Iteration t
6 4 2 0 2 4 6
c Iteration t1
Gaussian Process Optimization in the Bandit Setting
40
Figure 2. a Example of temperature data collected by a network of 46 sensors at Intel Research Berkeley. b,c Two iterations of the GPUCB algorithm. It samples points that are either uncertain b or have high posterior mean c.
delicate for nonlinear random functions.
Importantly, note that the information gain is a prob lem dependent quantityproperties of both the ker nel and the input space will determine the growth of regret. In Section 5, we provide general methods for bounding T, either by efficient auxiliary computa tions or by direct expressions for specific kernels of interest. Our results match known lower bounds up to log factors in both the Karmed bandit and the ddimensional linear optimization case.
Theorem 2 Let D0,rd be compact and convex, dN, r0. Suppose that the kernel kx, x satisfies the following high probability bound on the derivatives of GP sample paths f : for some constants a, b0,
PrsupxD fxjLaeLb2, j1,,d.
Pick 0, 1, and define

t2 logt22232d log t2dbr log4da .
Running the GPUCB with t for a sample f of a GP with mean function zero and covariance function kx,x, we obtain a regret bound of OdTT with high probability. Precisely, with C18 log12 we have

PrRT C1TTT2T11.
The main challenge in our proof is to lift the regret bound in terms of the confidence ellipsoid to general D. The smoothness assumption on kx,x disqual ifies GPs with highly erratic sample paths. It holds for stationary kernels kx, xkxx which are four times differentiable Theorem 5 of GhosalRoy 2006, such as the Squared Exponential and Mat ern kernels with 2 see Section 5.2, while it is vio lated for the OrnsteinUhlenbeck kernel Mat ern with 12; a stationary variant of the Wiener process. For the latter, sample paths f are nondifferentiable al most everywhere with probability one and come with independent increments. We conjecture that a result of the form of Theorem 2 does not hold in this case.
Bounds for Arbitrary f in the RKHS. Thus far, we have assumed that the target function f is sampled from a GP prior and that the noise is N0,2 with known variance 2. We now analyze GPUCB in an agnostic setting, where f is an arbitrary function from the RKHS corresponding to kernel kx,x. Moreover, we allow the noise variables t to be an ar bitrary martingale difference sequence meaning that
For finite D, we obtain Theorem 1 Let 0,1 and t
of OTT logD with high probability. Precisely,
Bounds for a GP Prior.
the following bound.
2 logDt226. Running
for
GPUCB with t
a sample f of a GP with mean function zero and covariance function kx,x, we obtain a regret bound
PrRT C1TTT
whereC1 8log12.
T 11.
The proof essentially relates the regret to the growth
of the log volume of the confidence ellipsoid, and in a novel manner, shows how this growth is characterized by the information gain.
This theorem shows that, with high probability over samples from the GP, the cumulative regret is bounded in terms of the maximum information gain, forging a novel connection between GP optimization and exper imental design. This link is of fundamental technical importance, allowing us to generalize Theorem 1 to infinite decision spaces. Moreover, the submodularity of IyA;fA allows us to derive sharp a priori bounds, depending on choice and parameterization of k see Section 5. In the following theorem, we generalize our result to any compact and convex DRd under mild assumptions on the kernel function k.

Gaussian Process Optimization in the Bandit Setting
Linear d4
Squared exponential
Matern 2.5 Independent
Ett0 for all tN, uniformly bounded by . Note that we still run the same GPUCB algorithm, whose prior and noise model are misspecified in this case. Our following result shows that GPUCB attains sublinear regret even in the agnostic setting.
15
10
250 200 150
Theorem 3 Let 0, 1. Assume that the true underlying f lies in the RKHS H D corresponding
5 100
50
00 5 10 15 20
10 20 30 40 50
k
to the kernel kx,x, and that the noise t has zero
Eigenvalue rank T
Figure 3. Spectral decay left and information gain bound right for independent diagonal, linear, squared expo nential and Mat ern kernels 2.5. with equal trace.
into our results in Section 4 to obtain highprobability bounds on RT . This being a laborious procedure, one would prefer a priori bounds for T in practice which are simple analytical expressions of T and parameters of k. In this section, we sketch a general procedure for obtaining such expressions, instantiating them for a number of commonly used covariance functions, once more relying crucially on the greedy ED rule upper bound. Suppose that D is finite for now, and let ffxxD, KDkx,xx,xD. Sampling f at xt, we obtain ytNvTt f,2, where vtRD is the indicator vector associated with xt. We can upperbound the greedy maximum once more, by relaxing this constraint to vt1 in round t of the sequential method. For this relaxed greedy procedure, all vt are leading eigenvectors of KD, since successive covariance matrices of Pfyt1 share their eigenba sis with KD, while eigenvalues are damped according to how many times the corresponding eigenvector is selected. We can upperbound the information gain by considering the worstcase allocation of T samples to the minT,D leading eigenvectors of KD:
T12 max D log12mtt, 8 1e1 mt t1
subject to t mtT, and specKD12. . . . We can split the sum into two parts in order to obtain a bound to leading order. The following Theorem captures this intuition:
Theorem 4 For any TN and any T1,,T: T O2BTTTlognTT,
D D
where nTt1 t and BTtT1 t.
Therefore, if for some ToTthe first T eigenval ues carry most of the total mass nT , the information gain will be small. The more rapidly the spectrum of KD decays, the slower the growth of T . Figure 3 illustrates this intuition.
5.2. Bounds for Common Kernels
In this section we bound T for a range of commonly used covariance functions: finite dimensional linear,
mean conditioned on the history and is bounded by
almost surely. In particular, assume f2kB and
let t2B300t log3t. Running GPUCB with
, prior GP0,kx,x and noise model N0,2, t
we obtain a regret bound of O T B TTwith high probability over the noise. Precisely,
Independent Matern
Squared exponential
Linear d4
PrRT C1TTT whereC1 8log12.
T 11,
Note that while our theorem implicitly assumes that GPUCB has knowledge of an upper bound on fk, standard guessanddoubling approaches suffice if no such bound is known a priori. Comparing Theorem 2 and Theorem 3, the latter holds uniformly over all functions f with f k, while the former is a prob abilistic statement requiring knowledge of the GP that f is sampled from. In contrast, if fGP0, kx, x, then fk almost surely Wahba, 1990: sample paths are rougher than RKHS functions. Neither Theorem 2 nor 3 encompasses the other.
5. Bounding the Information Gain
Since the bounds developed in Section 4 depend on the information gain, the key remaining question is how to bound the quantity T for practical classes of kernels.
5.1. Submodularity and Greedy Maximization
In order to bound T , we have to maximize the infor mation gain FAIyA;f over all subsets AD of size T: a combinatorial problem in general. However, as noted in Section 2, FA is a submodular function, which implies the performance guarantee 5 for max imizing F sequentially by the greedy ED rule 4. Di viding both sides of 5 by 11e, we can upperbound T by 11e1IyAT ; f, where AT is constructed by the greedy procedure. Thus, somewhat counterin tuitively, instead of using submodularity to prove that F ATis nearoptimal, we use it in order to show that T is neargreedy. As noted in Section 2, the ED rule does not depend on observations yt and can be run without evaluating f.
The importance of this greedy bound is twofold. First, it allows us to numerically compute highly problemspecific bounds on T , which can be plugged
Eigenvalue
Bound onT

The Mat ern kernel is given by kx, x
21 r B r, r2lxx, where controls the smoothness of sample paths the smaller, the rougher and B is a modified Bessel function.
Theorem 5 Let DRd be compact and convex, dN. Assume the kernel function satisfies kx, x1.
1. Finite spectrum. For the ddimensional Bayesian linear regression case: TOd log T .
2. Exponential spectral decay. For the Squared Exponential kernel: TOlog T d1.
3. Power law spectral decay. For Mat ern kernels with 1: TOT dd12dd1log T .
We now provide a sketch of the proof. T is bounded by Theorem 4 in terms the eigendecay of the kernel matrix K D . If D is infinite or very large, we can use the operator spectrum of kx,x, which likewise decays rapidly. For the kernels of interest here, asymptotic expressions for the operator eigenvalues are given in Seeger et al. 2008, who derived bounds on the information gain for fixed and random designs in contrast to the worstcase information gain consid ered here, which is substantially more challenging to bound. The main challenge in the proof is to ensure the existence of discretizations DTD, dense in the limit, for which tail sums BTnT in Theorem 4 are close to corresponding operator spectra tail sums.
Together with Theorems 2 and 3, this result guaran tees sublinear regret of GPUCB for any dimension see Figure 1. For the Squared Exponential kernel, the dimension d appears as exponent of log T only, so
2 that the regret grows at most as OTlogTd1
Gaussian Process Optimization in the Bandit Setting
Squared Exponential and Mat ern kernels. Together with our results in Section 4, these imply sublinear regret bounds for GPUCB in all cases.
Finite dimensional linear kernels have the form kx, xxT x. GPs with this kernel correspond to random linear functions fxwT x, wN0,I.
The Squared Exponential kernel is kx,xexp2l21xx2, l a lengthscale parameter. Sample functions are differentiable to any order almost surely RasmussenWilliams, 2006.

For synthetic data, we sample random functions from a squared exponential kernel with lengthscale parameter 0.2. The sampling noise variance 2 was set to 0.025 or 5 of the signal variance. Our decision set D0, 1 is uniformly discretized into 1000 points. We run each algorithm for T1000 iterations with 0.1, averaging over 30 trials samples from the kernel. While the choice of t as recommended by Theorem 1 leads to competitive performance of GPUCB, we find using crossvalidation that the algorithm is improved by scaling t down by a factor 5. Note that we did not optimize constants in our regret bounds.
Next, we use temperature data collected from 46 sen sors deployed at Intel Research Berkeley over 5 days at 1 minute intervals, pertaining to the example in Sec tion 2. We take the first twothirds of the data set to compute the empirical covariance of the sensor read ings, and use it as the kernel matrix. The functions f for optimization consist of one set of observations from all the sensors taken from the remaining third of the data set, and the results for T46, 20.5 or 5 noise, 0.1 were averaged over 2000 possible choices of the objective function.
Lastly, we take data from traffic sensors deployed along the highway I880 South in California. The goal was to find the point of minimum speed in order to identify the most congested portion of the highway; we used traffic speed data for all working days from 6 AM to 11 AM for one month, from 357 sensors. We again use the covariance matrix from twothirds of the data set as kernel matrix, and test on the other third. The results for T357, 24.78 or 5 noise, 0.1 were averaged over 900 runs.
Figure 4 compares the mean average regret incurred by the different heuristics and the GPUCB algorithm on synthetic and real data. For temperature data, the GPUCB algorithm and EI heuristic clearly outperform the others, and do not exhibit significant difference between each other. On synthetic and traf fic data MPI does equally well. In summary, GPUCB performs at least on par with the existing approaches which are not equipped with regret bounds.
7. Conclusions
We prove the first sublinear regret bounds for GP optimization with commonly used kernels see Fig ure 1, both for f sampled from a known GP and f of low RKHS norm. We analyze GPUCB, an intuitive, Bayesian upper confidence bound based sampling rule. Our regret bounds crucially depend on the information gain due to sampling, establishing a novel connection between bandit optimization and experimental design.
the high degree of smoothness of the sample paths effectively combats the curse of dimensionality.
6. Experiments
We compare GPUCB with heuristics such as the Expected Improvement EI and Most Probable Improvement MPI, and with naive methods which choose points of maximum mean or variance only, both on synthetic and real sensor network data.

Gaussian Process Optimization in the Bandit Setting
1 5 35 30 25 20 15 10 5 000
Var only
0.8 4 0.6 3 0.4 2
Mean only
MPI
UCB
0.2 1
0 20 40 60 80 100 0 10 20 30 40
Iterations Iterations
0
100
200 300
Iterations
Var only
Mean only
UCB
MPI
Var only
Mean only
MPI
EI
EI
EI
UCB
a Squared exponential b Temperature data
Figure 4. Comparison of performance: GPUCB and various heuristics on synthetic a, and sensor network data b, c.
We bound the information gain in terms of the kernel spectrum, providing a general methodology for obtain ing regret bounds with kernels of interest. Our exper iments on real sensor network data indicate that GP UCB performs at least on par with competing criteria for GP optimization, for which no regret bounds are known at present. Our results provide an interesting step towards understanding explorationexploitation tradeoffs with complex utility functions.
Acknowledgements. We thank Marcus Hutter for insightful comments on an earlier version. Research partially supported by ONR grant N000140911044, NSF grant CNS0932392, a gift from Microsoft Cor poration and the Excellence Initiative of the German research foundation DFG.
References
Auer, P. Using confidence bounds for exploitation exploration tradeoffs. JMLR, 3:397422, 2002.
Auer, P., CesaBianchi, N., and Fischer, P. Finitetime analysis of the multiarmed bandit problem. Mach. Learn., 4723:235256, 2002.
Brochu, E., Cora, M., and de Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical re inforcement learning. In TR200923, UBC, 2009.
Bubeck, S., Munos, R., Stoltz, G., and Szepesva ri, C. On line optimization in Xarmed bandits. In NIPS, 2008.
Chaloner, K. and Verdinelli, I. Bayesian experimental de sign: A review. Stat. Sci., 103:273304, 1995.
Cover, T. M. and Thomas, J. A. Elements of Information Theory. Wiley Interscience, 1991.
Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In COLT, 2008.
Dorard, L., Glowacka, D., and ShaweTaylor, J. Gaussian process modelling of dependencies in multiarmed bandit problems. In Int. Symp. Op. Res., 2009.
Ghosal, S. and Roy, A. Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann. Stat., 345:24132429, 2006.
Gru newa lder, S., Audibert, JY., Opper, M., and Shawe Taylor, J. Regret bounds for gaussian process bandit
problems. In AISTATS, 2010.
Huang, D., Allen, T. T., Notz, W. I., and Zeng, N. Global optimization of stochastic blackbox systems via sequen tial kriging metamodels. J Glob. Opt., 34:441466, 06.
Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive blackbox functions. J Glob. Opti., 13:455492, 1998.
Kleinberg, R., Slivkins, A., and Upfal, E. Multiarmed bandits in metric spaces. In STOC, pp. 681690, 2008.
Ko, C., Lee, J., and Queyranne, M. An exact algorithm for maximum entropy sampling. Ops Res, 434:684691, 1995.
Kocsis, L. and Szepesva ri, C. Bandit based montecarlo planning. In ECML, 2006.
Krause, A. and Guestrin, C. Nearoptimal nonmyopic value of information in graphical models. In UAI, 2005.
Lizotte, D., Wang, T., Bowling, M., and Schuurmans, D. Automatic gait optimization with Gaussian process re gression. In IJCAI, pp. 944949, 2007.
Mockus, J. Bayesian Approach to Global Optimization. Kluwer Academic Publishers, 1989.
Mockus, J., Tiesis, V., and Zilinskas, A. Toward Global Optimization, volume 2, chapter Bayesian Methods for Seeking the Extremum, pp. 117128. 1978.
Nemhauser, G., Wolsey, L., and Fisher, M. An analysis of the approximations for maximizing submodular set functions. Math. Prog., 14:265294, 1978.
Pandey, S. and Olston, C. Handling advertisements of un known quality in search advertising. In NIPS. 2007.
Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro cesses for Machine Learning. MIT Press, 2006.
Robbins, H. Some aspects of the sequential design of ex periments. Bul. Am. Math. Soc., 58:527535, 1952.
Seeger, M. W., Kakade, S. M., and Foster, D. P. Infor mation consistency of nonparametric Gaussian process methods. IEEE Tr. Inf. Theo., 545:23762382, 2008.
Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaus sian process optimization in the bandit setting: No re gret and experimental design. arXiv:0912.3995, 2009.
Vazquez, E. and Bect, J. Convergence properties of the expected improvement algorithm, 2007.
Wahba, G. Spline Models for Observational Data. SIAM, 1990.
c Traffic data
Mean Average Regret
Mean Average Regret
Mean Average Regret

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] C algorithm math MPI graph network Bayesian theory Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design
$25