Bayesian model selection
Consider the regression problem, where we want to predict the values of an unknown function y: RdR given examples Dxi,yiNi1 to serve as training data. In Bayesian linear
regression, we made the following assumption about yx:
yxxwx, 1
where x is a now explicitlywritten feature expansion of x. We proceed in the normal Bayesian way: we place Gaussian priors on our unknowns, the parameters w and the residuals , then derive the posterior distribution over w given D, which we use to make predictions.
One question left unanswered is how to choose a good feature expansion function x. For example, a purely linear model could use x1, x, whereas a quadratic model could use x1, x, x2, etc. In general, arbitrary feature expansionsare allowed. How can I select between them? Even more generally, how do I select whether I should use linear regression or a completely different probabilistic model to explain my data? These are questions of model selection, and naturally there is a Bayesian approach to it.
Before we continue our discussion of model selection, we will first define the word model, which is often used loosely without explicit definition. A model is a parametric family of probability distributions, each of which can explain the observed data. Another way to explain the concept of a model is that if we have chosen a likelihood pD for our data, which depends on a parameter , then the model is the set of all likelihoods each one of which is a distribution over D for every possible value of the parameter .
In the case of linear regression, the weight vector w defines the parametric family, and the model is the set of distributions
pyX, w, 2N y; Xw, 2I,
indexed by all possible w. Each one of these is a potential explanation of the observed values y given X. In the case of flipping a coin n times with an unknown biasand observing the number of heads x, the model is
pxn, Binomialx, n, ,
where there is one binomial distribution for every possible 0, 1. In the Bayesian method, we maintain a belief over which elements in the model we consider plausible by reasoning about pD via Bayes theorem.
Suppose now that I have at my disposal a finite set of models Mii that I may use to explain my observed data D, and let us write i for the parameters of model Mi. How do we know which model to prefer? We work out the posterior probability over the models via Bayes theorem! We have:
pDMi PrMi PrMiDj pDMjPrMj.
Here PrMi is a prior distribution over models that we have selected; a common practice is to set this to a uniform distribution over the models. The value pDMi may also be written in a morefamiliar familiar form:
pDMi
pDi, MipiMi di.
1
This is exactly the denominator when applying Bayes theorem to find the posterior piD, Mi! pDi, MipiMi pDi, MipiMi
piD,Mi pDi,MipiMidi.pDMi ,
where we have simply conditioned on Mi to be explicit. In the context of model selection, the term pDMi is known as the model evidence or simply evidence. One interpretation of the model evidence is the probability that your model could have generated the observed data, under the chosen prior belief over its parameters i.
Suppose now that we have exactly two models for the observed data that we wish to compare: M1 and M2, with corresponding parameter vectors 1 and 2 and prior probabilities PrM1 and PrM2. In this case it is easiest to compute the posterior odds, the ratio of the models probabilities given the data:
PrM1D PrM1pDM1 PrM1pD1, M1p1M1 d1 PrM DPrMpDMPrMpD,Mp Md ,
2 22222222
which is simply the prior odds multiplied by the ratio of the evidence for each model. The latter quantity is also called the Bayes factor in favor of M1. Publishing Bayes factors allows another practitioner to easily substitute their own model priors and derive their own conclusions about the models being considered.
Example
Wikipedia gives a truly excellent example of Bayesian model selection in practice.1 Suppose I am presented with a coin and want to compare two models for explaining its behavior. The first model, M1, assumes that the heads probability is fixed to 12. Notice that this model does not have any parameters. The second model, M2, assumes that the heads probability is fixed to an unknown value 0, 1, with a uniform prior on : pM21 this is equivalent to a beta prior onwith 1. For simplicity, we choose a uniform model prior: PrM1PrM212.
Suppose we flip the coin n200 times and observe x115 heads. Which model should we prefer in light of this data? We compute the model evidence for each model. The model evidence for M1 is quite straightforward, as it has no parameters:
200 1
Prxn, M1 Binomialn, x, 12115 22000.005956.
The model evidence for M2 requires integrating over the parameter :
Prxn,M2Prxn,,M2pM2d1 200 115 200115
1151 d 0
10.004975. 201
The Bayes factor in favor of M1 is approximately 1.2, so the data give very weak evidence in favor of the simpler model M1.
1 http:en.wikipedia.orgwikiBayesfactorExample 2
An interesting aside here is that a frequentist hypothesis test would reject the null hypothesis 1 at the 0.05 level. The probability of generating at least 115 heads under model M1
2
is approximately 0.02 similarly, the probability of generating at least 115 tails is also 0.02, so a twosided test would give a pvalue of approximately 4.
Occams razor
One spin on Bayesian decision theory is that it automatically gives a preference towards simpler models, in line with Occams razor. One way to see this is to consider the model evidence pDM as a probability distribution over datasets D. More complex models can explain more datasets, so the support of this distribution is wider in the sample space. But note that the distribution must normalize over the sample space as well, so we pay a price for generality. When moving from a simpler model to a more complex model, the probability of some datasets that are well explained by the simpler model must inevitably decrease to give up probability mass for the newly explained datasets in the widened support of the morecomplex model. The model selection process then drives us to select the model that is just complex enough to explain the data at hand, an inbuild Occams razor.
In the coin flipping example above, model M1 can only explain datasets with empirical heads probability reasonably near 1 . An observation of 200 heads, for example, would have astronomically
2
small probability under this model. The second model M2 can explain any set of observations by
selecting an appropriate . The price for this generality, though, is that datasets with a roughly equal number of heads and tails have a smaller prior probability under the model than before.
Model selection for Bayesian linear regression
A common application for model selection is for selecting between feature expansion functions x in Bayesian linear regression. Here the model Mi could for example correspond to orderi polynomial regression with
ix1,x,x2,xi.
After selecting a set of these models to compare, as well as a prior probability for each, the only remaining task is to compute the evidence for each model in observed data X, y. In our discussion of Bayesian linear regression, we have actually already computed the desired quantity:
pyX, 2, MiN y; iX, iXiX2I, where I have explicitly written the basis expansion in i.
Note that the model i can also easily explain all datasets wellexplained by the models j for ji, by simply setting the weights on higherorder terms to zero. Again, however, the simpler model will be preferred due the Occams razor effect described above.
Bayesian Model Averaging
Note that a full Bayesian treatment of a problem would eschew model selection entirely. Instead, when making predictions, we should theoretically use the sum rule to marginalize the unknown model, e.g.:
pyx,Dpyx,D,MiPrMiD. i
Such an approach is called Bayesian model averaging. Although this is sometimes seen, model selection is still used widely in practice. The reason is that the computational overhead of using a
3
single model is much lower than having to continually retrain multiple models, and that Bayesian model averaging uses a mixture distribution for predictions, which can have annoying analytic properties for example, the predictive distribution could be multimodal.
4
Reviews
There are no reviews yet.