1 Introduction

An indifferent belief about the bias of a coin can be expressed in various ways. One way is to use elementary sentences like

(1)

Another way is to give a probability measure on the space [0, 1] of biases:

(2)

This sentence (2) is more conceptually challenging than (1), but we may deduce (1) from (2) using Lebesgue integration:

figure a

where is the binomial coefficient. Here we have considered indifferent beliefs as a first illustration, but more generally de Finetti’s theorem gives a passage from sentences about beliefs like (1) to sentences like (2).

Theorem 1

(De Finetti, Informal). If we assign a probability to all finite experiment outcomes involving a coin (such as (1)), and these probabilities are all suitably consistent, then by doing so we have given a probability measure on [0, 1] (such as the uniform distribution, (2)).

In this paper we give two categorical/coalgebraic formulations of this theorem.

  • In Theorem 8, we state the result as: the unit interval [0, 1] is an inverse limit over a chain of finite spaces in a category of probability channels. In particular, a cone over this diagram is an assignment of finite probabilities such as (1) that is suitably consistent.

  • In Theorem 12, we state the result as: [0, 1] carries a final object in a category of (exchangeable) coalgebras for the functor \(2\times (-)\) in the category of probability channels. A coalgebra can be thought of as a process that coinductively generates probabilities of the form (1). A coalgebra is a little more data than strictly necessary, but this extra data corresponds to the concept of sufficient statistics which is useful in Bayesian inference.

Multisets and Probability Distributions. A multiset is a ‘set’ in which elements may occur multiple times. We can write such a multiset over a set \(\{R,G,B\}\), representing red, green and blue balls, as:

$$\begin{aligned} 3|{}R{}\rangle + 5|{}G{}\rangle + 2|{}B{}\rangle . \end{aligned}$$

This multiset can be seen as an urn containing three red, five green and two blue balls.

A probability distribution is a convex combination of elements, where the frequencies (or probabilities) with which elements occur are numbers in the unit interval [0, 1] that add up to one, as in:

$$\begin{aligned} \textstyle \frac{3}{10}|{}R{}\rangle + \frac{1}{2}|{}G{}\rangle + \frac{1}{5}|{}B{}\rangle . \end{aligned}$$

Obviously, distributions, whether discrete (as above) or continuous, play a central role in probability theory. Multisets also play a crucial role, for instance, an urn with coloured balls is very naturally represented as a multiset. Similarly, multiple data items in learning scenarios are naturally captured by multisets (see e.g.   [12]). The two concepts are related, since any multiset gives rise to a probability distribution, informally by thinking of it as an urn and then drawing from it (sampling with replacement).

This paper develops another example of how the interplay between multisets and distributions involves interesting structure, in an emerging area of categorical probability theory.

Beliefs as Cones. In brief, following an experiment involving K tosses of a coin, we can set up an urn that approximately simulates the coin according to the frequencies that we have seen. So if we see 5 heads out of 7 tosses, we put 5 black balls into an urn together with 2 white balls, so that the chance of black (=heads) is \(\frac{5}{7}\). We can describe a belief about a coin by describing the probabilities of ending up with certain urns (multisets) after certain experiments. Clearly, if we draw and discard a ball from an urn, then this is the same as stopping the experiment one toss sooner. So a consistent belief about our coin determines a cone over the chain

(3)

in the category of finite sets and channels (aka probability kernels). Here \(\mathcal {M}[K](2)\) is the set of multisets over 2 of size K, i.e.  urns containing K balls coloured black and white, and the leftward arrows describe randomly drawing and discarding a ball. We use the notation for channels, a.k.a. Kleisli map, see below for details. Our formulation of de Finetti’s theorem says that the limit of this chain is [0, 1]. So to give a belief, i.e.  a cone over (3), is to give a probability measure on [0, 1]. We set up the theory of multisets and cones in Sect. 3, proving the theorem in Sect. 6.

Exchangeable Coalgebras. In many situations, it is helpful to describe a belief about a coin in terms of a stateful generative process, i.e.  a channel (aka probability kernel) , for some space X, which gives coinductively a probability for a sequence of outcomes. This is a coalgebra for the functor \(2\times (-)\). An important example is Pólya’s urn, which is an urn-based process, but one where the urn changes over time. Another important example puts the carrier set \(X=[0,1]\) as the set of possible biases, and the process merely tosses a coin with the given bias. We show that [0, 1] carries the final object among the ‘exchangeable’ coalgebras. Thus there is a canonical channel , taking the belief described by a generative process such as Pólya’s urn to a probability measure on [0, 1] such as the uniform distribution. We set up the theory of exchangeable coalgebras in Sect. 4, proving the final coalgebra theorem in Sect. 7 by using the limit reformulation of de Finetti’s theorem.

2 Background on Mulitsets and Distributions

Multisets. A multiset (or bag) is a ‘set’ in which elements may occur multiple times. We shall write \(\mathcal {M}(X)\) for the set of (finite) multisets with elements from a set X. Elements \(\varphi \in \mathcal {M}(X)\) may be described as functions \(\varphi :X \rightarrow \mathbb {N}\) with finite support, that is, with is finite. Alternatively, multisets can be described as formal sums \(\sum _{i} n_{i}|{}x_i{}\rangle \), where \(n_{i}\in \mathbb {N}\) describes the multiplicity of \(x_{i}\in X\), that is, the number of times that the element \(x_i\) occurs in the multiset. The mapping \(X \mapsto \mathcal {M}(X)\) is a monad on the category \(\mathbf {Sets} \) of sets and functions. In fact, \(\mathcal {M}(X)\) with pointwise addition and empty multiset \(\mathbf {0}\), is the free commutative monoid on X.

Frequently we need to have a grip on the total number of elements occurring in a multiset. Therefore we write, for \(K\in \mathbb {N}\),

figure b

Clearly, \(\mathcal {M}[0](X)\) is a singleton, containing only the empty multiset \(\mathbf {0}\). Each \(\mathcal {M}[K](X)\) is a quotient of the set of sequences \(X^K\) via an accumulator map:

(4)

where \(\mathbf {1}_{x_i}\) is the indicator function sending \(x\in X\) to 1 if \(x=x_{i}\) and to 0 otherwise. In particular, for \(X=2\), . The map is permutation-invariant, in the sense that:

figure c

Such permutation-invariance, or exchangeability, plays an important role in de Finetti’s work  [5], see the end of Sect. 7.

Discrete Probability. A discrete probability distribution on a set X is a finite formal sum \(\sum _{i}r_{i}|{}x_i{}\rangle \) of elements \(x_{i}\in X\) with multiplicity \(r_{i}\in [0,1]\) such that \(\sum _{i}r_{i} = 1\). It can also be described as a function \(X \rightarrow [0,1]\) with finite support. We write \(\mathcal {D}(X)\) for the set of distributions on X. This gives a monad \(\mathcal {D}\) on \(\mathbf {Sets} \); it satisfies \(\mathcal {D}(1) \cong 1\) and \(\mathcal {D}(2) \cong [0,1]\), where \(1 = \{0\}\) and \(2 = \{0,1\}\).

Maps in the associated Kleisli category \(\mathcal {K}{}\ell (\mathcal {D})\) will be called channels and are written with a special arrow . Thus denotes a function \({f:X \rightarrow \mathcal {D}(Y)}\). For a distribution \(\omega \in \mathcal {D}(X)\) and a channel we define a new distribution and a new channel by Kleisli composition:

figure d

Discrete Probability over Multisets. For each number \(K\in \mathbb {N}\) and probability \(r\in [0,1]\) there is the familiar binomial distribution . It captures probabilities for iterated flips of a known coin, and is given by the convex sum:

figure e

The multiplicity probability before \(|{}k{}\rangle \) in this expression is the chance of getting k heads of out K coin flips, where each flip has fixed bias \(r\in [0,1]\). In this way we obtain a channel .

More generally, one can define a multinomial channel:

It is defined as:

figure f

The binomial channel is a special case, since \([0,1]\cong \mathcal {D}(2)\) and \(\{0,1,\ldots ,K\} \cong \mathcal {M}[K](2)\), via \(k \mapsto k|{}0{}\rangle + (K-k)|{}1{}\rangle \). The binary case will play an important role in the sequel.

3 Drawing from an Urn

Recall our motivation. We have an unknown coin, and we are willing to give probabilities to the outcomes of different finite experiments with it. From this we ultimately want to say something about the coin. The only experiments we consider comprise tossing a coin a fixed number K times and recording the number of heads. The coin has no memory, and so the order of heads and tails doesn’t matter. So we can imagine an urn containing black balls and white balls, and in an experiment we start with an empty urn and put a black ball in the urn for each head and a white ball for each tail. The urn forgets the order in which the heads and tails appear. We can then describe our belief about the coin in terms of our beliefs about the probability of certain urns occurring in the different experiments. These probabilities should be consistent in the following informal sense, that connects the different experiments.

  • If we run the experiment with \(K+1\) tosses, and then discard a random ball from the urn, it should be the same as running the experiment with just K tosses.

In more detail: because the coin tosses are exchangeable, that is, the coin is memoryless, to run the experiment with \(K+1\) tosses, is to run the experiment with \(K+1\) tosses and then report some chosen permutation of the results. This justifies representing the outcome of the experiment with an unordered urn. It follows that the statistics of the experiment are the same if we choose a permutation of the results at random. To stop the experiment after K tosses, i.e. to discard the ball corresponding to the last toss, is thus to discard a ball unifomly at random.

As already mentioned: an urn filled with certain objects of the kind \(x_{1}, \ldots , x_{n}\) is very naturally described as a multiset \(\sum _{i}n_{i}|{}x_i{}\rangle \), where \(n_{i}\) is the number of objects \(x_i\) in the urn. Thus an urn containing objects from a set X is a multiset in \(\mathcal {M}(X)\). For instance, an urn with three black and two white balls can be described as a multiset \(3|{}B{}\rangle + 2|{}W{}\rangle \in \mathcal {M}(2)\), where B stands for black and W for white. We can thus formalise a belief about the K-th experiment, where we toss our coin K times, as a distribution \(\omega _K\in \mathcal {D}(\mathcal {M}[K](2))\), on urns with K black and white balls.

For the consistency property, we need to talk about drawing and discarding from an urn. We define a channel that draws a single item from an urn:

(5)

It is defined as:

figure g

Concretely, in the above example:

figure h

Notice that on the right-hand-side we use brackets \(|{}-{}\rangle \) inside brackets \(|{}-{}\rangle \). The outer \(|{}-{}\rangle \) are for \(\mathcal {D}\) and the inner ones for \(\mathcal {M}\).

We can now describe the ‘draw-and-delete’ channel, by post-composing with the second marginal:

Explicitly, .

Now the consistency property requires that the beliefs about the coin comprise a sequence of distributions \(\omega _K\in \mathcal {D}(\mathcal {M}[K](2))\) such that . Thus they should form a cone over the sequence

(6)

This means that for all K,

$$\begin{aligned} \omega _K(\varphi )\ =\ \textstyle \frac{\varphi (B)+1}{K+1}\cdot \omega _{K+1}(\varphi +|{}B{}\rangle ) + \frac{\varphi (W)+1}{K+1}\cdot \omega _{K+1}(\varphi +|{}W{}\rangle )\text{. } \end{aligned}$$
(7)

Known Bias is a Consistent Belief. If we know for sure that the bias of a coin is \(r\in [0,1]\), then we would put . This is a consistent belief, which is generalised in the following result to the multinomial case. This cone forms the basis for this paper.

Proposition 2

The multinomial channels form a cone for the chain of draw-and-delete channels, as in:

figure i

Our reformulation of the de Finetti theorem explains the sense in which above cone is a limit when \(X=2\), see Theorem 8.

Proof

For \(\omega \in \mathcal {D}(X)\) and \(\varphi \in \mathcal {M}[K](X)\) we have:

figure j

Unknown Bias (Uniform) is a Consistent Belief. If we do not know the bias of our coin, it may be reasonable to say that all the \((K+1)\) outcomes in \(\mathcal {M}[K](2)\) are equiprobable. This determines the discrete uniform distribution \(\upsilon _K\in \mathcal {D}(\mathcal {M}[K](2))\),

figure k

Proposition 3

The discrete uniform distributions \(\upsilon _K\) form a cone for the chain of draw-and-delete channels, as in:

figure l

Proof

For \(\varphi \in \mathcal {M}[K](2)\) we have:

figure m

4 Coalgebras and Pólya’s Urn

In the previous section we have seen two examples of cones over the chain (6): one for a coin with known bias (Proposition 2), and one where the bias of the coin is uniformly distributed (Proposition 3).

In practice, it is helpful to describe our belief about the coin in terms of a data generating process. This is formalised as a channel , i.e.  as a coalgebra for the functor \(2\times (-)\) on the category of channels. The idea is that our belief about the coin is captured by some parameters, which are the states of the coalgebra \(x\in X\). The distribution h(x) describes our belief about the outcome of the next coin toss, but also provides a new belief state according to the outcome. For example, if we have uniform belief about the bias of the coin, we predict the outcome to be heads or tails equally, but afterward the toss our belief is refined: if we saw heads, we believe the coin to be more biased towards heads.

Pólya’s urn is an important example of such a coalgebra. We describe it as a multiset over \(2 = \{0,1\}\) with a certain extraction and re-fill operation. The number 0 corresponds to black balls, and 1 to white ones, so that a multiset \(\varphi \in \mathcal {M}(2)\) captures an urn with \(\varphi (0)\) black and \(\varphi (1)\) white ones. When a ball of color \(x\in 2\) is extracted, it is put back together with an extra ball of the same color. We choose to describe it as a coalgebra on the set \(\mathcal {M}_{*}(2)\) of non-empty multisets over 2. Explicitly, is given by:

figure n

So the urn keeps track of the outcomes of the coin tosses so far, and our belief about the bias of the coin is updated according to successive experiments.

4.1 Iterating Draws from Coalgebras

By iterating a coalgebra , we can simulate the experiment with K coin tosses , by repeatedly applying h. Formally, we build a sequence of channels by and:

figure o

where the rightmost map is .

(Another way to see this is to regard a coalgebra in particular as coalgebra . Now \(\mathcal {M}(2)\times -\) is the writer monad for the commutative monoid \(\mathcal {M}(2)\), i.e. the monad for \(\mathcal {M}(2)\)-actions. Iterated Kleisli composition for this monad gives a sequence of maps .)

For example, starting with Pólya’s urn , we look at a few iterations:

(10)

It is not hard to see that we get as general formula, for \(K\in \mathbb {N}\):

figure p

We are especially interested in the first component of \(h^K(x)\), which is in \(\mathcal {M}[K](2)\). Thus we define the channels:

figure q

For the Pólya urn , the really remarkable fact about is that if we start with the urn with one white ball and one black ball, these distributions on \(\mathcal {M}[K](2)\) that come from iteratively drawing are uniform distributions:

figure r

More generally:

Proposition 4

(11)

Proof

We start from the K-th iteration formula for , given just before Proposition 4, and derive:

figure s

From this characterization, we can deduce that the channels form a cone over the draw-and-delete chain (6), i.e.  that , by a routine calculation.

In what follows we provide two other ways to see that forms a cone. First we deduce it using the fact that the coalgebra is exchangeable (Lemma 6). Second we deduce it by noticing that \( pol _K(b|{}0{}\rangle +w|{}1{}\rangle )(k|{}0{}\rangle +(K-k)|{}1{}\rangle )\) is distributed as , the beta binomial distribution, and hence factors through the binomial cone (Proposition 2) via the beta channel (Lemma 7).

4.2 Exchangeable Coalgebras and Cones

We would like to focus on those coalgebras that describe consistent beliefs, i.e.  for which is a cone over the chain (6):

(12)

For example, Pólya’s urn forms a cone. But the deterministic alternating coalgebra

(13)

does not form a cone, since for instance , while .

The following condition is inspired by de Finetti’s notion of exchangeability, and gives a simple coinductive criterion for a coalgebra to yield a cone (12). It is also related to ideas from probabilistic programming (e.g.   [1]).

Definition 5

A coalgebra will be called exchangeable if its outputs can be reordered without changing the statistics, as expressed by commutation of the following diagram.

figure t

Returning to Pólya’s urn coalgebra , we see that it is exchangeable, for, by a similar argument to (10),

figure u

Reordering the results does not change the statistics, because \(b\cdot w=w\cdot b\). On the other hand, the deterministic alternating coalgebra from (13) is not exchangeable, since:

figure v

Before we state and prove our theorem about exchangeable coalgebras, we also consider a different sequence, which keeps track of the order of draws. We define a sequence , with \(h^{\sharp 0}(x)=((),x)\) and . Then we define channels by

(15)

These constructions are related similar to the earlier constructions \(h^K\) and \(h_K\) by

figure w

where is the accumulator map from (4). The difference is that the \(h^{\sharp }\) variants keep track of the order of draws, which will be useful in our arguments about exchangeability, since h is exchangeable (Definition 5) if and only if .

Lemma 6

Let be an exchangeable coalgebra. Then

  1. 1.

    the map \(h_{K}\) can be expressed as:

    (16)
  2. 2.

    the collection of maps \((h_{K})\) forms a cone for the chain of draw-and-delete maps (6).

Proof

  1. 1.

    Among \(b_{1}, \ldots , b_{K} \in 2\) there are -many possibilities of having k-many ones. By exchangeability of h there is a normal form with all these ones upfront

    $$ \begin{array}{rcl} h^{\sharp }_{K}(x)\big (b_{1},\ldots , b_{K}\big )= & {} h^{\sharp }_{K}(x)\big (\,\underbrace{1,\ldots ,1}_{k\text { times}}, \, \underbrace{0,\ldots ,0}_{K-k\text { times}}\big ) \end{array} $$
  2. 2.

    Using (7) we get, for any \(\varphi = (k|{}1{}\rangle +(K-k)|{}0{}\rangle ) \in \mathcal {M}[K](2)\),

    figure x

Since the Pólya’s urn coalgebra is exchangeable, the lemma provides another way to deduce that the channels form a cone (12).

5 Background on Continuous Probability

De Finetti’s theorem is about the passage from discrete probability to continuous probability. Consider the uniform distribution \(\mu \) on [0, 1]. The probability of any particular point is 0. So rather than give a probability to individual points, we give probabilities to sets; for example, the probability of a point in the half interval \([0,\frac{1}{2}]\) is \(\frac{1}{2}\).

In general, a \(\sigma \)-algebra on a set X is a collection of sets \(\varSigma \) that is closed under countable unions and complements. A measurable space \((X,\varSigma )\) is a set together with a \(\sigma \)-algebra; the sets in \(\varSigma \) are called measurable. In this paper we mainly consider two kinds of \(\sigma \)-algebra, and so they will often be left implicit:

  • on a countable set (e.g.  2, \(\mathcal {M}(2)\), \(\mathcal {M}[K](2)\)) we consider the \(\sigma \)-algebra where every set is measurable.

  • on the unit interval [0, 1], the Borel \(\sigma \)-algebra, which is the least \(\sigma \)-algebra containing the intervals.

On a measurable space \((X,\varSigma )\), a probability measure is a function \(\omega :\varSigma \rightarrow [0,1]\) with \(\omega (X) = 1\) and with \(\omega (\bigcup _{i}U_{i}) = \sum _{i}\omega (U_{i})\) when the measurable subsets \(U_{i}\in \varSigma \) are pairwise disjoint. On a finite set X the probability measures coincide with the discrete distributions.

The morphisms of measurable spaces are the functions \(f:X\rightarrow Y\) for which the inverse image of a measurable set is measurable. Crucially, if \(f:X\rightarrow [0,1]\) is measurable and \(\omega :\varSigma \rightarrow [0,1]\) is a probability measure, then we can find the Lebesgue integral \(\int f(x)\,\omega (\mathrm {d}x)\), which is the expected value of f. When X is finite, the Lebesgue integral is a sum: \(\int f(x)\,\omega (\mathrm {d}x)=\sum _Xf(x)\omega (\{x\})\). When \(X=[0,1]\) and \(\omega \) is the uniform distribution, then the Lebesgue integral is the familiar one.

We write \(\mathcal {G}(X,\varSigma )\) for the set of all probability measures on \((X,\varSigma )\), where we often omit the \(\sigma \)-algebra \(\varSigma \) when it is clear from the context.: \(\mathcal {D}(X) \cong \mathcal {G}(X)\), when all subsets are seen as measurable. The set \(\mathcal {G}(X,\varSigma )\) itself has a \(\sigma \)-algebra, which is the least \(\sigma \)-algebra making the sets \(\{\omega ~|~\omega (U)<r\}\) measurable for all fixed U and r.

Like \(\mathcal {D}\), the construction \(\mathcal {G}\) is a monad, commonly called the Giry monad, but \(\mathcal {G}\) is a monad on the category of measurable spaces. Maps in the associated Kleisli category \(\mathcal {K}{}\ell (\mathcal {G})\) will also be called channels. Thus between measurable spaces can be described as a function \({f:X \times \varSigma _Y\rightarrow [0,1]}\) that is a measure in \(\varSigma _Y\) and measurable in X. For a measure \(\omega \in \mathcal {G}(X)\) and a channel we define a new distribution and a new channel by Kleisli composition:

figure y

Notice that a discrete channel is in particular a continuous channel, according to our terminology, and so we have an inclusion of Kleisli categories categories \(\mathcal {K}{}\ell (\mathcal {D}) \hookrightarrow \mathcal {K}{}\ell (\mathcal {G})\).

Illustrations from the Beta-Bernoulli Distributions. At the end of Sect. 4.1, we briefly mentioned the beta-bernoulli distribution as it arises from the coalgebra for Pólya’s urn. In general, this beta-binomial channel is defined for arbitrary \(\alpha ,\beta \in \mathbb {R}_{> 0}\) as:

(17)

This description involves the function defined as:

(18)

This is a well-known mathematical function satisfying for instance:

(19)

for \(n,m\in \mathbb {N}\). The latter equation shows that (11) is an instance of (17).

The function is also used to define the Beta distribution on the unit interval [0, 1]. We describe it here as a channel in the Kleisli category of the Giry monad:

figure z

It is defined on \(\alpha ,\beta \in \mathbb {R}_{> 0}\) and measurable \(M\subseteq [0,1]\) as:

(20)

We prove some basic properties of the beta-binomial channel. These properties are well-known, but are formulated here in slightly non-standard form, using channels.

Lemma 7

  1. 1.

    A betabinomial channel can itself be decomposed as:

    figure aa
  2. 2.

    The beta-binomial channels

    figure ab

    for \(K\in \mathbb {N}\), form a cone for the infinite chain of draw-and-delete channels (6).

Proof

  1. 1.

    Composition in the Kleisli category \(\mathcal {K}{}\ell (\mathcal {G})\) of the Giry monad \(\mathcal {G}\) yields:

    figure ac
  2. 2.

    The easiest way to see this is by observing that the channels factorise via the binomial channels . In Proposition 2 we have seen that these binomial channels commute with draw-and-delete channels . \(\square \)

6 De Finetti as a Limit Theorem

We first describe our ‘limit’ reformulation of the de Finetti theorem and later on argue why this is a reformulation of the original result.

Theorem 8

For \(X = 2\) the cone (8) is a limit in the Kleisli category \(\mathcal {K}{}\ell (\mathcal {G})\) of the Giry monad; this limit is the unit interval \([0,1] \cong \mathcal {D}(2)\).

This means that if we have channels with , then there is a unique mediating channel with , for each \(K\in \mathbb {N}\).

In the previous section we have seen an example with \(Y = \mathbb {R}_{> 0}\times \mathbb {R}_{> 0}\) and and mediating channel .

Below we give the proof, in the current setting. We first illustrate the central role of the draw-and-delete maps and how they give rise to hypergeometric and binomial distributions.

Via the basic isomorphism \(\mathcal {M}[K](2) \cong \{0,1,\ldots ,K\}\) the draw-and-delete map (7) becomes a map , given by:

(21)

Notice that the border cases \(\ell = 0\) and \(\ell = K+1\) are handled implicitly, since the corresponding probabilities are then zero.

We can iterate , say N times, giving . If we draw-and-delete N times from an urn with \(K+N\) balls, the result is the same as sampling without replacement K times from the same urn. Sampling without replacement is modelled by the hypergeometric distribution. So we can calculate explicitly, using the hypergeometric distribution formula:

figure ad

Implicitly the formal sum is over \(k\le \ell \), since otherwise the corresponding probabilities are zero.

As is well known, for large N, the hypergeometric distribution can be approximated by the binomial distribution. To be precise, for fixed K, if \((\nicefrac {\ell }{K+N})\rightarrow p\) as \(N\rightarrow \infty \), then the hypergeometric distribution approaches the binomial distribution as N goes to infinity. This works as follows. Using the notation \((n)_{m} = \prod _{i<m} (n-i) = n(n-1)(n-2)\cdots (n-m+1)\) we can write and thus:

figure ae

The proof of Theorem 8 makes use of a classical result of Hausdorff  [10], giving a criterion for the existence of probability measures. It is repeated here without proof (but see e.g.   [6] for details).

Recall that the moments of a distribution \(\mu \in \mathcal {G}([0,1])\) form a sequence of numbers \(\mathfrak {m}_1, \dots , \mathfrak {m}_K,\ldots \in [0,1]\) given by \(\mathfrak {m}_K=\int _0^1 r^K\,\mu (\mathrm {d}r)\): the expected values of \(r^K\). In general, moments correspond to statistical concepts such as mean, variance, skew and so on. But in this specific case, the K-th moment of \(\mu \) is the probability that K balls drawn will all be black.

Recall that a sequence \(a = (a_{n})\) is completely monotone if:

$$ \begin{array}{rcl} (-1)^{k}\cdot \big (\varDelta ^{k} a\big )_{n}\ge & {} 0 \qquad \text{ for } \text{ all } n,k. \end{array} $$

This formulation involves the difference operator \(\varDelta \) on sequences, given by \((\varDelta a)_{n} = a_{n+1} - a_{n}\). This operator is iterated, in the standard way as \(\varDelta ^{0} = \mathrm {id}_{}\) and \(\varDelta ^{m+1} = \varDelta \mathrel {\circ }\varDelta ^{m}\).

Theorem 9

(Hausdorff). A sequence \(a = (a_{n})_{n\in \mathbb {N}}\) of non-negative real numbers \(a_n\) is completely monotone if and only if it arises as sequence of moments: there is a unique measure \(\mu \in \mathcal {G}([0,1])\) with, for each n,

figure af

Proof

(of Theorem 8). As a first step, we shall reason pointwise. Let, for each \(K\in \mathbb {N}\), a distribution \(\omega _{K}\in \mathcal {D}(\{0,1,\ldots ,K\})\) be given, with . We need to produce a unique (continuous) distribution \(\omega \in \mathcal {G}([0,1])\) such that , for each K.

The problem of finding \(\omega \) is essentially Hausdorff’s moment problem, see Theorem 9. The connection between Hausdorff’s moments problem and de Finetti’s theorem is well known  [6, §VII.4], but the connection is particularly apparent in our channel-based formulation.

We exploit Hausdorff’s criterion via two claims, which we elaborate in some detail. The first one is standard.

Claim 1. If \(\mu \in \mathcal {G}([0,1])\) has moments \(\mathfrak {m}_{K}\), then:

$$ \begin{array}{rcl} \displaystyle \int _{0}^{1} x^{n}\cdot (1-x)^{k} \,\mu (\mathrm {d}x)= & {} (-1)^{k}\cdot \big (\varDelta ^{k} \mathfrak {m}\big )_{n}. \end{array} $$

This equation can be proven by induction on k. It holds by definition of \(\mathfrak {m}\) for \(k=0\). Next:

figure ag

The next claim involves a crucial observation about our setting: the whole sequence of distributions \(\omega _{K} \in \mathcal {D}(\{0,1,\ldots ,K\})\) is entirely determined by the probabilities \(\omega _{K}(K)\) of drawing K balls which are all black.

Claim 2. The numbers , for \(K\in \mathbb {N}\), determine all the distributions \(\omega _K\) via the relationship:

figure ah

In particular, the sequence \(\mathfrak {b}_{K}\) is completely monotone.

The proof works again by induction on k. The case \(k=0\) holds by definition. For the induction step we use the cone property . Via (21) it gives:

$$ \begin{array}{rcl} \omega _{k}(i)= & {} \frac{i+1}{k+1}\cdot \omega _{k+1}(i+1) + \frac{k+1-i}{k+1}\cdot \omega _{k+1}(i), \end{array} $$

so that:

$$ \begin{array}{rcl} \omega _{k+1}(i)= & {} \frac{k+1}{k+1-i}\cdot \omega _{k}(i) - \frac{i+1}{k+1-i}\cdot \omega _{k+1}(i+1). \end{array} $$

We use this equation in the first step below.

figure ai

This second claim implies that there is a unique distribution \(\omega \in \mathcal {G}([0,1])\) whose K-th moments equals \(\mathfrak {b}_{K} = \omega _{K}(K)\). But then:

figure aj

As an aside, it is perhaps instructive to look at how the distribution \(\omega \in \mathcal {G}([0,1])\) is built in the proof of Hausdorff’s criterion. One way is as a limiting measure of the sequence of finite discrete distributions on [0, 1],

(22)

This is illustrated in the following table, for the examples of the binomial cone and the uniform cone. The graphs show \(\bar{\omega }_K\) (plotted as densities with respect to the counting measure).

We conclude by explaining why this construction gives a measurable mediating map. Suppose we have a measurable space \((Y,\varSigma )\) and a sequence of channels forming a cone over (8), i.e.   for all \(K\in \mathbb {N}\). We define a mediating channel by putting the distribution c(y) as the unique one with moments \(c_K(y)(K)\), as above. It remains to show that c, regarded as a function \(Y\rightarrow \mathcal {G}([0,1])\), is a measurable function. It is immediate that the sequence-of-moments function \(\mathfrak {m} :Y\rightarrow [0,1]^\infty \) given by:

figure ak

is measurable, because the \(c_K\)’s are assumed to be channels: let \(V \subseteq [0,1]\) be measurable; for associated basic measurable subsets of \([0,1]^{\infty }\) one has:

$$ \begin{array}{rcccl} \mathfrak {m}^{-1}\big ([0,1]^{K} \times V \times [0,1]^{\infty }\big )= & {} \{y\in Y\;|\;c_{K}(y)(K) \in V\}= & {} c_{K}^{-1}\big ([0,1]^{K} \times V\big ). \end{array} $$

Further, \(c:Y\rightarrow \mathcal {G}([0,1])\) factors through \(\mathfrak {m}\); indeed \(\mathcal {G}([0,1])\) can be thought of as a subset of \([0,1]^\infty \), when each distribution is regarded as its sequence of moments. We conclude by the less well-known fact that \(\mathcal {G}([0,1])\) is a subspace of \([0,1]^\infty \).    \(\square \)

figure al

Theorem 8 is a reformulation of a classical result of de Finetti  [5]. It introduced the concept of exchangeability of random variables which has disappeared completely from our reformulation, but is implicit in our use of multisets. This requires some explanation.

A (discrete) random variable is a pair \(\rho \in \mathcal {D}(A), U:A \rightarrow \mathbb {R}\) of a distribution and an observable. Usually the distribution \(\rho \) is implicit. Instead, the notation P(U) is used for the image distribution \(\mathcal {D}(U)(\rho )\) on \(\mathbb {R}\), also sometimes written as \(P(U=r)\), for \(r\in \mathbb {R}\).

Let \((U_{1},\rho _{1}), \ldots , (U_{n},\rho _{n})\) be a finite sequence of random variables. It induces a joint distribution on \(\mathbb {R}^n\),

figure am

where \(U_{1}\times \cdots \times U_{n} :A_{1} \times \cdots \times A_{n} \rightarrow \mathbb {R}^{n}\) and \(\rho _{1} \otimes \cdots \otimes \rho _{n} \in \mathcal {D}(A_{1} \times \cdots \times A_{n})\).

Such a sequence \((U_{i},\rho _{i})\) is called exchangeable if for each permutation \(\pi \):

$$ \begin{array}{rcl} P(U_{1}, \ldots , U_{n})= & {} P(U_{\pi (1)}, \ldots , U_{\pi (n)}). \end{array} $$

An infinite sequence is called exchangeable if each finite subsequence is exchangeable.

The original form of de Finetti’s theorem involves binary (0–1) random variables with \(U_{i} :A_{i} \rightarrow 2 = \{0,1\} \hookrightarrow \mathbb {R}\). The distribution \(P(U_{i})\) is then on a ‘Bernouilli’ distribution on 2. When random variables are exchangeable their order doesn’t matter but only how many of them are zero or one. This is captured by their sum \(\sum _{i}U_{i}\). But this sum is a multiset over 2. Indeed, in a multset, the order of elements does not matter, only their multiplicities (numbers of occurrences). This is a crucial observation underlying our reformulation, that already applied to the accumulator map in (4). But let’s first look at the usual formulation of de Finetti’s (binary) representation theorem, see  [5] and the textbooks  [6, 16].

Theorem 10

Let \(U_{1}, U_{2}, \ldots \) be an infinite sequence of exchangeable binary random variables. Then there is a probability measure \(\omega \) on [0, 1] such that for all n,

$$ \begin{array}{rcl} P(U_{1}, \ldots , U_{n})= & {} \displaystyle \int _{0}^{1} x^{\sum _{i}U_{i}}(1-x)^{(n-\sum _{i}U_{i})}\,\omega ({}\mathrm {d}{}x). \end{array} $$

It is clear that there is a considerable gap between this original formulation and the inverse limit formulation that has been developed in this paper. There is room for narrowing this gap.

The essence that is used in de Finetti’s theorem about n exchangeable binary random variables are their outcomes via the accumulator map from (4), giving a map .

In our approach we have reformulated the requirements about random variables and about exchangeability and worked directly with distributions on multisets, see Proposition 2 and Theorem 8. De Finetti’s (binary) representation Theorem 10 becomes a limit statement in a Kleisli category. We think that this approach brings forward the mathematical essentials.

7 A Final Exchangeable Coalgebra

We return to the setting of coalgebras, begun in Sect. 4. Any coalgebra for the functor \(2\times (-)\) on \(\mathcal {K}{}\ell (\mathcal {D})\) or on \(\mathcal {K}{}\ell (\mathcal {G})\) is a probabilistic process that produces a random infinite sequence of boolean values. Following  [14], we can understand this induced sequence as arising from the fact that the pair of head and tail function

figure an

is a final coalgebra in the category \(\mathcal {K}{}\ell (\mathcal {G})\), and so there is a unique channel assigning to each element of X the random sequence that it generates. Here \(2^\infty \) is the measurable space of infinite binary streams, regarded with the product \(\sigma \)-algebra.

It is helpful to spell this construction out a little. The sequence of first projections

(23)

has as a limit the space \(2^\infty \), with the limiting cone being the obvious projections. And this limit is also a limit in \(\mathcal {K}{}\ell (\mathcal {G})\). (This is a categorical formulation of Kolmogorov’s extension theorem, see  [4, Thm. 2.5].)

Now, as usual in coalgebra theory, we find for any coalgebra by noticing that the channels (15) define a cone over the chain (23). The limiting channel for the cone is the unique coalgebra homomorphism to the final coalgebra.

Example 11

We define the bernoulli coalgebra to be the channel given by . In words, makes a coin toss with known bias r and returns the outcome together with the same bias r unchanged.

The final channel takes a probability r to a random infinite sequence of independent and identically distributed binary values, all distributed as \(r|{}1{}\rangle +(1-r)|{}0{}\rangle \). More explicitly, for basic measurable subsets, at each stage n,

figure ao

So, any coalgebra defines a cone over (23). We have shown in Lemma 6 that any exchangeable coalgebra defines a cone over the draw-delete chain (6). We use this to show that the bernoulli coalgebra is the final exchangeable coalgebra.

The final coalgebra is not exchangeable. To see this, consider the infinite alternating sequence \(01010101\dots \in 2^\infty \), so that the left hand side of (14) at the sequence gives \((0,1,010101\dots )\), whereas the right hand side of (14) at the sequence gives \((1,0,010101\dots )\), which is different.

The bernoulli coalgebra is exchangeable, because

figure ap

Notice that reordering the outputs does not change the statistics.

Theorem 12

The bernoulli coalgebra is the final exchangeable coalgebra.

Proof

Let be an exchangeable coalgebra. By Lemma 6 the associated maps form a cone. By de Finetti’s theorem in limit form—that is, by Theorem 8—we get a unique mediating channel with for each K. Our aim is to show that f is a unique coalgebra map: .

We do know that the limiting map is a unique coalgebra homomorphism. So to show that f is a unique coalgebra homomorphism, it suffices to show that is a monomorphism in \(\mathcal {K}{}\ell (\mathcal {G})\).

To see this, first notice that the K-th moment of some distribution \(\omega \in \mathcal {G}([0,1])\) is the probability , and so if two random sequences are equal, , then in particular the moments of \(\omega _1\) and \(\omega _2\) are equal and so \(\omega _1=\omega _2\). Moreover the endofunctor \(2\times -\) preserves monomorphisms. So if the final channel factors through as a channel then it also factors through as a coalgebra homomorphism, necessarily uniquely, by the definition of monomorphism. \(\square \)

Example 13

We return to the example of Pólya’s urn coalgebra . We have shown in Proposition 4 that the cone defined by iterated draws is given by the beta-bernoulli distributions: , and we have shown in Proposition 7 that the beta-bernoulli cone over (6) has mediating map the beta distribution. So the final channel is given by putting \(\textit{beta}(\varphi )\) as the beta distribution with parameters \((\varphi (0),\varphi (1))\), and this is a coalgebra homomorphism. The commuting diagram

figure aq

expresses the ‘conjugacy’ between the bernoulli and beta distributions, see also  [13].

8 Conclusions and Further Work

This paper has translated the classical representation theorem of de Finetti into the language of multisets and distributions and turned it into a limit theorem. This limit is subsequently used for a finality result for exchangeable coalgebras.

We have concentrated, as is usual for the de Finetti theorem, on the binary case, giving a limit statement for the special case \(X = 2 = \{0,1\}\) of the cone in Proposition 2. There is an obvious question whether this formulation can be extended to \(X = \{0,1,\ldots ,n-1\}\) for \(n > 2\), or to more general measurable spaces. We see two possible avenues:

  • extension to finite sets via a multivariate version of Hausdorff’s moments theorem, see  [15];

  • extension to Polish spaces, following  [2] and its categorical justification for extending de Finetti’s theorem to such spaces (esp. Theorems 35 and 40) in terms of exchangeability.

Another point for future work is to investigate whether the categorical formulation of de Finetti’s theorem could be taken as an axiom for a ‘synthetic probability theory’ (e.g.  following [7, 8, 17]), and indeed whether it holds in some proposed categorical models which do support de-Finetti-like arguments [11, 18].

Finally, we note that chains of channels between multisets also arise in recent exponential modalities for linear logic  [3, 9]; the connections remain to be worked out.