[SOLVED] R algorithm deep learning html parallel graph statistic software network Bayesian cuda Practical Bayesian Optimization of Machine Learning Algorithms

$25

File Name: R_algorithm_deep_learning_html_parallel_graph_statistic_software_network_Bayesian_cuda_Practical_Bayesian_Optimization_of_Machine_Learning_Algorithms.zip
File Size: 1403.58 KB

5/5 - (1 vote)

Practical Bayesian Optimization of Machine Learning Algorithms
Jasper Snoek
Department of Computer Science University of Toronto jaspercs.toronto.edu
Hugo Larochelle
Department of Computer Science University of Sherbrooke hugo.larochelleusherbrooke.edu
Ryan P. Adams
School of Engineering and Applied Sciences Harvard University rpaseas.harvard.edu
Abstract
The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is of ten a black art requiring expert experience, rules of thumb, or sometimes brute force search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian opti mization, in which a learning algorithms generalization performance is modeled as a sample from a Gaussian process GP. We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparame ters, can play a crucial role in obtaining a good optimizer that can achieve expert level performance. We describe new algorithms that take into account the variable cost duration of learning algorithm experiments and that can leverage the pres ence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expertlevel optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks.
1 Introduction
Machine learning algorithms are rarely parameterfree: parameters controlling the rate of learning or the capacity of the underlying model must often be specified. These parameters are often con sidered nuisances, making it appealing to develop machine learning algorithms with fewer of them. Another, more flexible take on this issue is to view the optimization of such parameters as a proce dure to be automated. Specifically, we could view such tuning as the optimization of an unknown blackbox function and invoke algorithms developed for such problems. A good choice is Bayesian optimization 1, which has been shown to outperform other state of the art global optimization algorithms on a number of challenging optimization benchmark functions 2. For continuous func tions, Bayesian optimization typically works by assuming the unknown function was sampled from a Gaussian process and maintains a posterior distribution for this function as observations are made or, in our case, as the results of running learning algorithm experiments with different hyperpa rameters are observed. To pick the hyperparameters of the next experiment, one can optimize the expected improvement EI 1 over the current best result or the Gaussian process upper confidence bound UCB3. EI and UCB have been shown to be efficient in the number of function evaluations required to find the global optimum of many multimodal blackbox functions 4, 3.
1

Machine learning algorithms, however, have certain characteristics that distinguish them from other blackbox optimization problems. First, each function evaluation can require a variable amount of time: training a small neural network with 10 hidden units will take less time than a bigger net work with 1000 hidden units. Even without considering duration, the advent of cloud computing makes it possible to quantify economically the cost of requiring largememory machines for learn ing, changing the actual cost in dollars of an experiment with a different number of hidden units. Second, machine learning experiments are often run in parallel, on multiple cores or machines. In both situations, the standard sequential approach of GP optimization can be suboptimal.
In this work, we identify good practices for Bayesian optimization of machine learning algorithms. We argue that a fully Bayesian treatment of the underlying GP kernel is preferred to the approach based on optimization of the GP hyperparameters, as previously proposed 5. Our second contri bution is the description of new algorithms for taking into account the variable and unknown cost of experiments or the availability of multiple cores to run experiments in parallel.
Gaussian processes have proven to be useful surrogate models for computer experiments and good practices have been established in this context for sensitivity analysis, calibration and prediction 6. While these strategies are not considered in the context of optimization, they can be useful to re searchers in machine learning who wish to understand better the sensitivity of their models to various hyperparameters. Hutter et al. 7 have developed sequential modelbased optimization strategies for the configuration of satisfiability and mixed integer programming solvers using random forests. The machine learning algorithms we consider, however, warrant a fully Bayesian treatment as their ex pensive nature necessitates minimizing the number of evaluations. Bayesian optimization strategies have also been used to tune the parameters of Markov chain Monte Carlo algorithms 8. Recently, Bergstra et al. 5 have explored various strategies for optimizing the hyperparameters of machine learning algorithms. They demonstrated that grid search strategies are inferior to random search 9, and suggested the use of Gaussian process Bayesian optimization, optimizing the hyperparameters of a squaredexponential covariance, and proposed the Tree Parzen Algorithm.
2 Bayesian Optimization with Gaussian Process Priors
As in other kinds of optimization, in Bayesian optimization we are interested in finding the mini mum of a function fx on some bounded set X, which we will take to be a subset of RD. What makes Bayesian optimization different from other procedures is that it constructs a probabilistic model for fx and then exploits this model to make decisions about where in X to next evaluate the function, while integrating out uncertainty. The essential philosophy is to use all of the informa tion available from previous evaluations of fx and not simply rely on local gradient and Hessian approximations. This results in a procedure that can find the minimum of difficult nonconvex func tions with relatively few evaluations, at the cost of performing more computation to determine the next point to try. When evaluations of fx are expensive to performas is the case when it requires training a machine learning algorithmthen it is easy to justify some extra computation to make better decisions. For an overview of the Bayesian optimization formalism and a review of previous work, see, e.g., Brochu et al. 10. In this section we briefly review the general Bayesian optimization approach, before discussing our novel contributions in Section 3.
There are two major choices that must be made when performing Bayesian optimization. First, one must select a prior over functions that will express assumptions about the function being optimized. For this we choose the Gaussian process prior, due to its flexibility and tractability. Second, we must choose an acquisition function, which is used to construct a utility function from the model posterior, allowing us to determine the next point to evaluate.
2.1 Gaussian Processes
The Gaussian process GP is a convenient and powerful prior distribution on functions, which we willtakeheretobeoftheformf : XR. TheGPisdefinedbythepropertythatanyfinitesetofN points xnX Nn1 induces a multivariate Gaussian distribution on RN . The nth of these points is taken to be the function value fxn, and the elegant marginalization properties of the Gaussian distribution allow us to compute marginals and conditionals in closed form. The support and prop erties of the resulting distribution on functions are determined by a mean function m : XR and a positive definite covariance function K : XXR. We will discuss the impact of covariance functions in Section 3.1. For an overview of Gaussian processes, see Rasmussen and Williams 11.
2

2.2 Acquisition Functions for Bayesian Optimization
We assume that the function fx is drawn from a Gaussian process prior and that our observa tions are of the form xn,ynNn1, where ynNfxn, andis the variance of noise intro duced into the function observations. This prior and these data induce a posterior over functions; the acquisition function, which we denote by a : XR, determines what point in X should be evaluated next via a proxy optimization xnextargmaxx ax, where several different functions have been proposed. In general, these acquisition functions depend on the previous observations, as well as the GP hyperparameters; we denote this dependence as ax ; xn, yn, . There are several popular choices of acquisition function. Under the Gaussian process prior, these functions depend on the model solely through its predictive mean function x ; xn, yn,and predictive variance function 2x ; xn, yn, . In the proceeding, we will denote the best current value as xbestargminxn fxn and the cumulative distribution function of the standard normal as .
Probability of Improvement One intuitive strategy is to maximize the probability of improving over the best current value 12. Under the GP this can be computed analytically as
aPIx;xn,yn,x, x fxbestx;xn,yn,. 1 x ; xn, yn,
Expected Improvement Alternatively, one could choose to maximize the expected improvement EI over the current best. This also has closed form under the Gaussian process:
aEIx ; xn, yn, x ; xn, yn,x xN x ; 0, 1 2
GP Upper Confidence Bound A more recent development is the idea of exploiting lower confi dence bounds upper, when considering maximization to construct acquisition functions that mini mize regret over the course of their optimization 3. These acquisition functions have the form
aLCBx ; xn, yn, x ; xn, yn,x ; xn, yn, , 3 with a tunableto balance exploitation against exploration.
In this work we will focus on the EI criterion, as it has been shown to be betterbehaved than probability of improvement, but unlike the method of GP upper confidence bounds GPUCB, it does not require its own tuning parameter. Although the EI algorithm performs well in minimization problems, we wish to note that the regret formalization may be more appropriate in some settings. We perform a direct comparison between our EIbased approach and GPUCB in Section 4.1.
3 Practical Considerations for Bayesian Optimization of Hyperparameters
Although an elegant framework for optimizing expensive functions, there are several limitations that have prevented it from becoming a widelyused technique for optimizing hyperparameters in machine learning problems. First, it is unclear for practical problems what an appropriate choice is for the covariance function and its associated hyperparameters. Second, as the function evaluation itself may involve a timeconsuming optimization procedure, problems may vary significantly in duration and this should be taken into account. Third, optimization algorithms should take advantage of multicore parallelism in order to map well onto modern computational environments. In this section, we propose solutions to each of these issues.
3.1 Covariance Functions and Treatment of Covariance Hyperparameters
The power of the Gaussian process to express a rich distribution on functions rests solely on the shoulders of the covariance function. While nondegenerate covariance functions correspond to infinite bases, they nevertheless can correspond to strong assumptions regarding likely functions. In particular, the automatic relevance determination ARD squared exponential kernel
1 D
KSEx,x0exp 2r2x,x r2x,x
xd xd2d2. 4
d1
is often a default choice for Gaussian process regression. However, sample functions with this co
variance function are unrealistically smooth for practical optimization problems. We instead propose the use of the ARD Mate rn 52 kernel:
2522 KM52x,x0 1 5r x,x3r x,x exp5r x,x . 5
3

a Posterior samples under varying hyperparameters
b Expected improvement under varying hyperparameters
c Integrated expected improvement
Figure 1: Illustration of integrated expected improve ment. a Three posterior samples are shown, each with different length scales, after the same five obser vations. b Three expected improvement acquisition functions, with the same data and hyperparameters. The maximum of each is shown. c The integrated expected improvement, with its maximum shown.
a Posterior samples after three data
b Expected improvement under three fantasies
c Expected improvement across fantasies
Figure 2: Illustration of the acquisition with pend ing evaluations. a Three data have been observed and three posterior functions are shown, with fan tasies for three pending evaluations. b Expected im provement, conditioned on the each joint fantasy of the pending outcome. c Expected improvement after in tegrating over the fantasy outcomes.
This covariance function results in sample functions which are twicedifferentiable, an assumption that corresponds to those made by, e.g., quasiNewton methods, but without requiring the smooth ness of the squared exponential.
After choosing the form of the covariance, we must also manage the hyperparameters that govern its behavior Note that these hyperparameters are distinct from those being subjected to the overall Bayesian optimization., as well as that of the mean function. For our problems of interest, typically we would have D3 Gaussian process hyperparameters: D length scales 1:D , the covariance amplitude 0, the observation noise , and a constant mean m. The most commonly advocated ap proach is to use a point estimate of these parameters by optimizing the marginal likelihood under the Gaussian process, pyxnNn1,,,mNym1, I, where yy1,y2,,yNT, andis the covariance matrix resulting from the N input points under the hyperparameters .
However, for a fullyBayesian treatment of hyperparameters summarized here byalone, it is desirable to marginalize over hyperparameters and compute the integrated acquisition function:

ax ; xn, yn
ax ; xn, yn,pxn, ynNn1 d, 6
where ax depends onand all of the observations. For probability of improvement and EI, this expectation is the correct generalization to account for uncertainty in hyperparameters. We can therefore blend acquisition functions arising from samples from the posterior over GP hyperparam eters and have a Monte Carlo estimate of the integrated expected improvement. These samples can be acquired efficiently using slice sampling, as described in Murray and Adams 13. As both opti mization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solving an N dimensional linear system and our function evaluations are assumed to be much more expen sive anyway, the fullyBayesian treatment is sensible and our empirical evaluations bear this out. Figure 1 shows how the integrated expected improvement changes the acquistion function.
3.2 Modeling Costs
Ultimately, the objective of Bayesian optimization is to find a good setting of our hyperparameters as quickly as possible. Greedy acquisition procedures such as expected improvement try to make
4

the best progress possible in the next function evaluation. From a practial point of view, however, we are not so concerned with function evaluations as with wallclock time. Different regions of the parameter space may result in vastly different execution times, due to varying regularization, learning rates, etc. To improve our performance in terms of wallclock time, we propose optimizing with the expected improvement per second, which prefers to acquire points that are not only likely to be good, but that are also likely to be evaluated quickly. This notion of cost can be naturally generalized to other budgeted resources, such as reagents or money.
Just as we do not know the true objective function fx, we also do not know the duration func tion cx : XR . We can nevertheless employ our Gaussian process machinery to model ln cx alongside f x. In this work, we assume that these functions are independent of each other, although their coupling may be usefully captured using GP variants of multitask learning e.g., 14, 15. Under the independence assumption, we can easily compute the predicted expected inverse duration and use it to compute the expected improvement per second as a function of x.
3.3 Monte Carlo Acquisition for Parallelizing Bayesian Optimization
With the advent of multicore computing, it is natural to ask how we can parallelize our Bayesian optimization procedures. More generally than simply batch parallelism, however, we would like to be able to decide what x should be evaluated next, even while a set of points are being evaluated. Clearly, we cannot use the same acquisition function again, or we will repeat one of the pending experiments. Ideally, we could perform a rollout of our acquisition policy, to choose a point that appropriately balanced information gain and exploitation. However, such rollouts are generally intractable. Instead we propose a sequential strategy that takes advantage of the tractable inference properties of the Gaussian process to compute Monte Carlo estimates of the acquisiton function under different possible results from pending function evaluations.
Consider the situation in which N evaluations have completed, yielding data xn, ynNn1, and in which J evaluations are pending at locations xjJj1. Ideally, we would choose a new point based on the expected acquisition function under all possible outcomes of these pending evaluations:
ax; xn,yn,,xj

ax; xn,yn,,xj,yjpyjJj1 xjJj1,xn,ynNn1dy1 dyJ. 7 RJ
This is simply the expectation of ax under a J dimensional Gaussian distribution, whose mean and covariance can easily be computed. As in the covariance hyperparameter case, it is straightforward to use samples from this distribution to compute the expected acquisition and use this to select the next point. Figure 2 shows how this procedure would operate with queued evaluations. We note that a similar approach is touched upon briefly by Ginsbourger and Riche 16, but they view it as too intractable to warrant attention. We have found our Monte Carlo estimation procedure to be highly effective in practice, however, as will be discussed in Section 4.
4 Empirical Analyses
In this section, we empirically analyse1 the algorithms introduced in this paper and compare to ex isting strategies and human performance on a number of challenging machine learning problems. We refer to our method of expected improvement while marginalizing GP hyperparameters as GP EI MCMC, optimizing hyperparameters as GP EI Opt, EI per second as GP EI per Second, and N times parallelized GP EI MCMC as Nx GP EI MCMC. Each results figure plots the progres sion of minxn fxn over the number of function evaluations or time, averaged over multiple runs of each algorithm. If not specified otherwise, xnextargmaxx ax is computed using gradient based search with multiple restarts see supplementary material for details. The code used is made publicly available at http:www.cs.toronto.edu jaspersoftware.html.
4.1 BraninHoo and Logistic Regression
We first compare to standard approaches and the recent Tree Parzen Algorithm2 TPA of Bergstra et al. 5 on two standard problems. The BraninHoo function is a common benchmark for Bayesian
1All experiments were conducted on identical machines using the Amazon EC2 service.
2Using the publicly available code from https:github.comjaberghyperoptwiki
5

35 30 25 20 15 10
5 0
GP EI Opt GP EI MCMC GPUCB TPA
40
0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08
GP EI MCMC
GP EI Opt
GP EI per Sec
Tree Parzen Algorithm
0.2 0.18 0.16 0.14 0.12 0.1 0.08
GP EI MCMC
GP EI per Second
5 10 15 20 25 30 35 40 45 Minutes
c
0
10
20
Function evaluations
a
50
1350 1350 GP EI MCMC
1350 GP EI per second 1340
1340 GP EI per second 1340 1330 GP EI Opt 1330
GP EI MCMC
GP EI Opt 1330
Random Grid Search
1320 3x GP EI MCMC 1320 1310 5x GP EI MCMC 1310
3x GP EI MCMC
5x GP EI MCMC 1320 10x GP EI MCMC 1310
1300 1290 1280 1270
1300 1290 1280 1270 1260
1300 1290 1280 1270 1260
30
0
20
40
Function Evaluations
b
80
100
Figure 3: Comparisons on the BraninHoo function 3a and training logistic regression on MNIST 3b. 3c shows GP EI MCMC and GP EI per Second from 3b, but in terms of time elapsed.
3x GP EI MCMC On grid 5x GP EI MCMC On grid 3x GP EI MCMC Off grid 5x GP EI MCMC Off grid
10x GP EI MCMC
1260
0 10 20 30 40 50 0 2 4 6 8 10 12 0 10 20 30 40 50
Function evaluations Time Days Function evaluations a b c
Figure 4: Different strategies of optimization on the Online LDA problem compared in terms of function evaluations 4a, walltime 4b and constrained to a grid or not 4c.
optimization techniques 2 that is defined over xR2 where 0x115 and 5×215. We also compare to TPA on a logistic regression classification task on the popular MNIST data. The algorithm requires choosing four hyperparameters, the learning rate for stochastic gradient descent, on a log scale from 0 to 1, the l2 regularization parameter, between 0 and 1, the mini batch size, from 20 to 2000 and the number of learning epochs, from 5 to 2000. Each algorithm was run on the BraninHoo and logistic regression problems 100 and 10 times respectively and mean and standard error are reported. The results of these analyses are presented in Figures 3a and 3b in terms of the number of times the function is evaluated. On BraninHoo, integrating over hyperparameters is superior to using a point estimate and the GP EI significantly outperforms TPA, finding the minimum in less than half as many evaluations, in both cases. For logistic regression, 3b and 3c show that although EI per second is less efficient in function evaluations it outperforms standard EI in time.
4.2 Online LDA
Latent Dirichlet Allocation LDA is a directed graphical model for documents in which words are generated from a mixture of multinomial topic distributions. Variational Bayes is a popular paradigm for learning and, recently, Hoffman et al. 17 proposed an online learning approach in that context. Online LDA requires 2 learning parameters, 0 and , that control the learning rate t0t used to update the variational parameters of LDA based on the tth minibatch of document word count vectors. The size of the minibatch is also a third parameter that must be chosen. Hoffman et al. 17 relied on an exhaustive grid search of size 668, for a total of 288 hyperparameter configurations.
We used the code made publically available by Hoffman et al. 17 to run experiments with online LDA on a collection of Wikipedia articles. We downloaded a random set of 249 560 articles, split into training, validation and test sets of size 200 000, 24 560 and 25 000 respectively. The documents are represented as vectors of word counts from a vocabulary of 7702 words. As reported in Hoffman et al. 17, we used a lower bound on the per word perplexity of the validation set documents as the performance measure. One must also specify the number of topics and the hyperparametersfor the symmetric Dirichlet prior over the topic distributions andfor the symmetric Dirichlet prior over the per document topic mixing weights. We followed Hoffman et al. 17 and used 100 topics and 0.01 in our experiments in order to emulate their analysis and repeated exactly the grid search reported in the paper3. Each online LDA evaluation generally took between five to ten hours to converge, thus the grid search requires approximately 60 to 120 processor days to complete.
3i.e. the only difference was the randomly sampled collection of articles in the data set and the choice of the vocabulary. We ran each evaluation for 10 hours or until convergence.
6
60
Min Function Value Min Function Value
Min function value
Min Function Value
Min Function Value Min Function Value

0.26
0.255
0.25
0.245
0.24
GP EI MCMC
0.26
0.28
GP EI MCMC
GP EI per Second 0.275 3x GP EI MCMC
3x GP EI per Second 0.27
0.265 0.26 0.255 0.25 0.245
GP EI per Second
3x GP EI MCMC
3x GP EI per Second 0.255 Random Grid Search
0.25
0.245
Matern 52 ARD SqExp
SqExp ARD Matern 32 ARD
Min function value
Min Function Value
Min Function Value
0.24
Time hours Function evaluations Function evaluations
a b c
0.24
0 5 10 15 20 25 0 20 40 60 80 100 0 20 40 60 80 100
Figure 5: A comparison of various strategies for optimizing the hyperparameters of M3E models on the protein motif finding task in terms of walltime 5a, function evaluations 5b and different covariance functions5c.
In Figures 4a and 4b we compare our various strategies of optimization over the same grid on this expensive problem. That is, the algorithms were restricted to only the exact parameter settings as evaluated by the grid search. Each optimization was then repeated 100 times each time picking two different random experiments to initialize the optimization with and the mean and standard error are reported4. Figure 4c also presents a 5 run average of optimization with 3 and 5 times parallelized GP EI MCMC, but without restricting the new parameter setting to be on the prespecified grid see supplementary material for details. A comparison with their on grid versions is illustrated.
Clearly integrating over hyperparameters is superior to using a point estimate in this case. While GP EI MCMC is the most efficient in terms of function evaluations, we see that parallelized GP EI MCMC finds the best parameters in significantly less time. Finally, in Figure 4c we see that the parallelized GP EI MCMC algorithms find a significantly better minimum value than was found in the grid search used by Hoffman et al. 17 while running a fraction of the number of experiments.
4.3 Motif Finding with Structured Support Vector Machines
In this example, we consider optimizing the learning parameters of MaxMargin MinEntropy M3E Models 18, which include Latent Structured Support Vector Machines 19 as a special case. Latent structured SVMs outperform SVMs on problems where they can explicitly model problemdependent hidden variables. A popular example task is the binary classification of pro tein DNA sequences 18, 20, 19. The hidden variable to be modeled is the unknown location of particular subsequences, or motifs, that are indicators of positive sequences.
Setting the hyperparameters, such as the regularisation term, C, of structured SVMs remains a chal lenge and these are typically set through a time consuming grid search procedure as is done in 18, 19. Indeed, Kumar et al. 20 avoided hyperparameter selection for this task as it was too computationally expensive. However, Miller et al. 18 demonstrate that results depend highly on the setting of the parameters, which differ for each protein. M3E models introduce an entropy term, parameterized by , which enables the model to outperform latent structured SVMs. This additional performance, however, comes at the expense of an additional problemdependent hyperparameter. We emulate the experiments of Miller et al. 18 for one protein with approximately 40 000 se quences. We explore 25 settings of the parameter C, on a log scale from 101 to 106, 14 settings of , on a log scale from 0.1 to 5 and the model convergence tolerance, 104,103,102,101. We ran a grid search over the 1400 possible combinations of these parameters, evaluating each over 5 random 5050 training and test splits.
In Figures 5a and 5b, we compare the randomized grid search to GP EI MCMC, GP EI per Second and their 3x parallelized versions, all constrained to the same points on the grid. Each algorithm was repeated 100 times and the mean and standard error are shown. We observe that the Bayesian optimization strategies are considerably more efficient than grid search which is the status quo. In this case, GP EI MCMC is superior to GP EI per Second in terms of function evaluations but GP EI per Second finds better parameters faster than GP EI MCMC as it learns to use a less strict convergence tolerance early on while exploring the other parameters. Indeed, 3x GP EI per second, is the least efficient in terms of function evaluations but finds better parameters faster than all the other algorithms. Figure 5c compares the use of various covariance functions in GP EI MCMC optimization on this problem, again repeating the optimization 100 times. It is clear that the selection
4The restriction of the search to the same grid was chosen for efficiency reasons: it allowed us to repeat the experiments several times efficiently, by first computing all function evaluations over the whole grid and reusing these values within each repeated experiment.
7

0.4
0.35
0.3
0.25
0.2 0
0.4
0.35 GP EI per Second
10
20
Function evaluations
40 50
GP EI MCMC
GP EI Opt
GP EI per Second
GP EI MCMC 3x Parallel Human Expert
GP EI MCMC
Min Function Value
Min function value
30
Figure 6: Validation error on the CIFAR10 data for different optimization strategies.
of an appropriate covariance significantly affects performance and the estimation of length scale parameters is critical. The assumption of the infinite differentiability as imposed by the commonly used squared exponential is too restrictive for this problem.
4.4 Convolutional Networks on CIFAR10
Neural networks and deep learning methods notoriously require careful tuning of numerous hyper parameters. Multilayer convolutional neural networks are an example of such a model for which a thorough exploration of architechtures and hyperparameters is beneficial, as demonstrated in Saxe et al. 21, but often computationally prohibitive. While Saxe et al. 21 demonstrate a methodology for efficiently exploring model architechtures, numerous hyperparameters, such as regularisation parameters, remain. In this empirical analysis, we tune nine hyperparameters of a threelayer con volutional network 22 on the CIFAR10 benchmark dataset using the code provided 5. This model has been carefully tuned by a human expert 22 to achieve a highly competitive result of 18 test error on the unaugmented data, which matches the published state of the art result 23 on CIFAR 10. The parameters we explore include the number of epochs to run the model, the learning rate, four weight costs one for each layer and the softmax output weights, and the width, scale and power of the response normalization on the pooling layers of the network.
We optimize over the nine parameters for each strategy on a withheld validation set and report the mean validation error and standard error over five separate randomly initialized runs. Results are presented in Figure 6 and contrasted with the average results achieved using the best parameters found by the expert. The best hyperparameters found by the GP EI MCMC approach achieve an error on the test set of 14.98, which is over 3 better than the expert and the state of the art on CIFAR10. The same procedure was repeated on the CIFAR10 data augmented with horizontal reflections and translations, similarly improving on the expert from 11 to 9.5 test error. To our knowledge this is the lowest error reported, compared to the 11 state of the art and a recently published 11.21 24 using similar methods, on the competitive CIFAR10 benchmark.
5 Conclusion
We presented methods for performing Bayesian optimization for hyperparameter selection of gen eral machine learning algorithms. We introduced a fully Bayesian treatment for EI, and algorithms for dealing with variable time regimes and running experiments in parallel. The effectiveness of our approaches were demonstrated on three challenging recently published problems spanning different areas of machine learning. The resulting Bayesian optimization finds better hyperparameters sig nificantly faster than the approaches used by the authors and surpasses a human expert at selecting hyperparameters on the competitive CIFAR10 dataset, beating the state of the art by over 3.
Acknowledgements
The authors thank Alex Krizhevsky, Hoffman et al. 17 and Miller et al. 18 for making their code and data available, and George Dahl for valuable feedback. This work was funded by DARPA Young Faculty Award N660011214219, NSERC and an Amazon AWS in Research grant.
References
1 Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2:117129, 1978.
5Available at: http:code.google.compcudaconvnet 8
0.3
0.25
0.2
GP EI Opt
GP EI MCMC 3x Parallel
0 10 20 30 40 50 60 70
Time Hours

2 D.R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 214:345383, 2001.
3 Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, 2010.
4 Adam D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 34:28792904, 2011.
5 James S. Bergstra, Re mi Bardenet, Yoshua Bengio, and Ba la zs Ke gl. Algorithms for hyper parameter optimization. In Advances in Neural Information Processing Systems 25. 2011.
6 Marc C. Kennedy and Anthony OHagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B Statistical Methodology, 633, 2001.
7 FrankHutter,HolgerH.Hoos,andKevinLeytonBrown.Sequentialmodelbasedoptimization for general algorithm configuration. In Learning and Intelligent Optimization 5, 2011.
8 Nimalan Mahendran, Ziyu Wang, Firas Hamze, and Nando de Freitas. Adaptive mcmc with bayesian optimization. In AISTATS, 2012.
9 JamesBergstraandYoshuaBengio.Randomsearchforhyperparameteroptimization.Journal of Machine Learning Research, 13:281305, 2012.
10 Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of ex pensive cost functions, with application to active user modeling and hierarchical reinforcement learning. preprint, 2010. arXiv:1012.2599.
11 CarlE.RasmussenandChristopherWilliams.GaussianProcessesforMachineLearning.MIT Press, 2006.
12 H. J. Kushner. A new method for locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86, 1964.
13 IainMurrayandRyanP.Adams.SlicesamplingcovariancehyperparametersoflatentGaussian models. In Advances in Neural Information Processing Systems 24, pages 17231731. 2010.
14 Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models. In AISTATS, 2005.
15 Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K. I. Williams. Multitask Gaussian process prediction. In Advances in Neural Information Processing Systems 22, 2008.
16 David Ginsbourger and Rodolphe Le Riche. Dealing with asynchronicity in parallel Gaus sian process based global optimization. http:hal.archivesouvertes.fr hal00507632, 2010.
17 Matthew Hoffman, David M. Blei, and Francis Bach. Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 24, 2010.
18 KevinMiller,M.PawanKumar,BenjaminPacker,DannyGoodman,andDaphneKoller.Max margin minentropy models. In AISTATS, 2012.
19 ChunNam John Yu and Thorsten Joachims. Learning structural SVMs with latent variables. In Proceedings of the 26th International Conference on Machine Learning, 2009.
20 M.PawanKumar,BenjaminPacker,andDaphneKoller.Selfpacedlearningforlatentvariable models. In Advances in Neural Information Processing Systems 25. 2010.
21 AndrewSaxe,PangWeiKoh,ZhenghaoChen,ManeeshBhand,BipinSuresh,andAndrewNg. On random weights and unsupervised feature learning. In Proceedings of the 28th International Conference on Machine Learning, 2011.
22 Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto, 2009.
23 Adam Coates and Andrew Y. Ng. Selecting receptive fields in deep networks. In Advances in Neural Information Processing Systems 25. 2011.
24 Dan Claudiu Ciresan, Ueli Meier, and Ju rgen Schmidhuber. Multicolumn deep neural net works for image classification. In Computer Vision and Pattern Recognition, 2012.
9

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] R algorithm deep learning html parallel graph statistic software network Bayesian cuda Practical Bayesian Optimization of Machine Learning Algorithms
$25