Bayesian non-parametrics and the probabilistic approach to modelling - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Dec 31;371(1984):20110553.
doi: 10.1098/rsta.2011.0553. Print 2013 Feb 13.

Bayesian non-parametrics and the probabilistic approach to modelling

Affiliations

Bayesian non-parametrics and the probabilistic approach to modelling

Zoubin Ghahramani. Philos Trans A Math Phys Eng Sci. .

Abstract

Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian non-parametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian non-parametrics. The survey covers the use of Bayesian non-parametrics for modelling unknown functions, density estimation, clustering, time-series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically, it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman's coalescent, Dirichlet diffusion trees and Wishart processes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Marginal likelihoods, Occam’s razor and overfitting: consider modelling a function y=f(x)+ϵ describing the relationship between some input variable x, and some output or response variable y. (a) The red dots in the plots on the left-hand side are a dataset of eight (x,y) pairs of points. There are many possible f that could model this given data. Let us consider polynomials of different order, ranging from constant (M=0), linear (M=1), quadratic (M=2), etc., to seventh order (M=7). The blue curves depict maximum-likelihood polynomials fit to the data under Gaussian noise assumptions (i.e. least-squares fits). Clearly, the M=7 polynomial can fit the data perfectly, but it seems to be overfitting wildly, predicting that the function will shoot off up or down between neighbouring observed data points. By contrast, the constant polynomial may be underfitting, in the sense that it might not pick up some of the structure in the data. The green curves indicate 20 random samples from the Bayesian posterior of polynomials of different order given this data. A Gaussian prior was used for the coefficients, and an inverse gamma prior on the noise variance (these conjugate choices mean that the posterior can be analytically integrated). The samples show that there is considerable posterior uncertainty given the data, and also that the maximum-likelihood estimate can be very different from the typical sample from the posterior. (b) The normalized model evidence or marginal likelihood for this model is plotted as a function of the model order, P(Y |M), where the dataset Y are the eight observed output y values. Note that given the data, model orders ranging from M=0 to M=3 have considerably higher marginal likelihood than other model orders, which seems plausible given the data. Higher-order models, M>3, have relatively much smaller marginal likelihood, which is not visible on this scale. The decrease in marginal likelihood as a function of model order is a reflection of the automatic Occam razor that results from Bayesian marginalization.
Figure 2.
Figure 2.
An illustration of Occam’s razor. Consider all possible datasets of some fixed size n. Competing probabilistic models correspond to alternative distributions over the datasets. Here, we have illustrated three possible models that spread their probability mass in different ways over these possible datasets. A complex model (shown in blue) spreads its mass over many more possible datasets, whereas a simple model (shown in green) concentrates its mass on a smaller fraction of possible data. Because probabilities have to sum to one, the complex model spreads its mass at the cost of not being able to model simple datasets as well as a simple model—this normalization is what results in an automatic Occam razor. Given any particular dataset, here indicated by the dotted line, we can use the marginal likelihood to reject both overly simple models, and overly complex models. This figure is inspired by a figure from MacKay [10], and an actual realization of this figure on a toy classification problem is discussed in Murray & Ghahramani [11].
Figure 3.
Figure 3.
A sample from an IBP matrix, with columns reordered. Each row has, on average, 10 ones. Note the logarithmic growth of non-zero columns with rows. For the ‘restaurant’ analogy where customers enter a buffet with infinitely many dishes, you can refer to the original IBP papers.
Figure 4.
Figure 4.
A diagram representing how some models relate to each other. We start from finite mixture models and consider three different ways of extending them. Orange arrows correspond to time-series versions of static (iid) models. Blue arrows correspond to Bayesian non-parametric versions of finite parametric models. Green arrows correspond to factorial (overlapping subset) versions of clustering (non-overlapping) models. ifHMM, infinite factorial hidden Markov model.

Similar articles

Cited by

References

    1. Wolpert DM, Ghahramani Z, Jordan MI. 1995. An internal model for sensorimotor integration. Science 269, 1880–188210.1126/science.7569931 (doi:10.1126/science.7569931) - DOI - DOI - PubMed
    1. Knill D, Richards W. 1996. Perception as Bayesian inference. Cambridge, UK: Cambridge University Press.
    1. Griffiths TL, Tenenbaum JB. 2006. Optimal predictions in everyday cognition. Psychol. Sci. 17, 767–77310.1111/j.1467-9280.2006.01780.x (doi:10.1111/j.1467-9280.2006.01780.x) - DOI - DOI - PubMed
    1. Doob JL. 1949. Application of the theory of martingales. Coll. Int. Centre Nat. Res. Sci. 13, 23–27
    1. Le Cam L. 1986. Asymptotic methods in statistical decision theory. Berlin, Germany: Springer.