# Functional Purity, Exchangeability, and De Finetti's Theorem

*To return to the top level: Language Models Tutorial.*

By Timothy J. O'Donnell, Noah D. Goodman, Andreas Stuhlmueller, and the Church Working Group.

## Contents |

# Purity

A procedure in a programming language is said to be *pure* if when applied to some arguments it is guaranteed to always return the same value. All traditional mathematical functions are pure by definition. For instance, the procedure below is pure.

No matter how many times we apply `add-one` to some number we will always get back the same answer. The return value of this procedure only depends on the argument bound to the formal parameter `x`. In particular, the return value of a pure function cannot depend on any external state. Many programming languages provide a procedure such as `date` or `time` which returns the date/time as part of their standard library. Because they depend on the external state of the computer's clock, such procedures are impure.

Another way that a procedure can be impure is by having *side-effects*. A side effect is some "output" or modification of state by a procedure in addition to its normal output. For instance, the following procedure modifies the global variable `my-variable` to the current time on each call.

(define (my-proc x) (begin (set! my-variable (time)) (+ x 1)))

Notice here the procedure `set!`. This is a Scheme procedure which *mutates* the value of a variable. Mutation refers to changing the state of a variable "in place". In a pure setting variables and datastructures are *persistent*. This means that once they are created they **can never be changed**. *New* datastructures can be constructed form them, by adding or subtracting parts, etc., however, they cannot be changed in place. Mutation of existing state is what introduces side-effects.

Purity is an important concept in the study of deterministic programming languages because pure programs are easier to reason about. This means that programs written in a pure style are often easier to engineer, to implement, to optimize and to parallelize. Purity guarantees a "what you see is what you get" relationship between inputs and outputs of a function. Nothing surprising can happen behind the scenes by mutating state.

Some programming languages guarantee purity by not providing any facility for doing mutation at all. Church is one such language (although as will be discussed in the next section, the relevant notion of purity is a generalization of that discussed here).

An important intuition about (non-)purity is that mutation and other side-effects introduce a dependence on time in a computation. There is some moment that the state of the program is mutated, and there is a moment after the state has been mutated. To reason about a program that mutates state which must keep track of the points in time that changes were made. This imposes a "before" and an "after" on the structure of the computation. Before mutation, for example, variable `my-variable` had one value, afterward it had another.

In a pure program, however, there is no notion of time. More complex computations can be built from smaller ones via composition. However, the dependence between parts of a pure program is not time based. All expressions in a pure deterministic program which have a value (i.e. which halt) will always have that value and always had that value. The computation of that value can depend *causally* on other the computation of other expressions, but it does not depend *temporally*.

Generative models are meant to express causal processes. Because of this it may be desirable to work with a language that is pure. However, at first blush this seems impossible. Generative models are probabilistic; the expressions we use to define them *sample* values. Thus, for example, two evaluations of the same expression in Church are not guaranteed to return the same value. What we would like is some generalization of the notion of purity to the probabilistic setting. In the next section we discuss this generalization: *exchangeability*.

# Exchangeability

Suppose that we have some sequence of random variables.

- <math>X_1, X_2, X_3, ...</math>

This is sequence is *exchangeable* if for every permutation of indices <math>\pi</math>

- <math>X_{\pi(1)}, X_{\pi(2)}, X_{\pi(3)}, ...</math>

The joint distribution over the sequence is the same. In other words, if we see some sequence of <math>N</math> observations, then we are free to treat this sequence as a *set*: the probability of a set is not dependent on order.

Imagine that we have a sequence of independent and identically-distributed (i.i.d.) random variables, such as a sequence of <math>N</math> unbiased coin flips.

It is easy to see that such a sequence is exchangeable. Suppose we get back `(true true false false)`, this clearly has the same probability as `(true false true false)`, `(false true true false)`, etc.

All i.i.d. random variables generate exchangeable sequences. However, there are also exchangeable sequences which are not i.i.d. Consider the following process. We want generate a sequence of coin flips in the following way. We start with two variables `a` and `b`. We generate the first observation by flipping a coin with weight <math>\frac{a}{a + b}</math>. If it comes out `true` then we add <math>1</math> to `a` and if it comes our `false` we add <math>1</math> to `b`. We then generate the second coin with the new <math>a</math> and <math>b</math>. Based on the outcome of the second coin, we update our counts again and continue. We can construct a procedure which implements this process for arbitrary `a` and `b` with the following Scheme code.

(define (sample-coin a b) (let ((heads a) (total (+ a b))) (lambda () (let ((x (flip (/ heads total))]) (set! heads (+ heads (if x 1 0))) (set! total (+ total 1)) x ))))

The preceding code returns a procedure which generates sequences in the way just described. Such a sequence is clearly not i.i.d. Each draw depends on the preceding draws. However, it is exchangeable.

- <math>p(</math>
`(true true false false)`<math>) = \frac{a}{(a+b)} \frac{(a+1)}{(a+b+1)} \frac{b}{(a+b+2)}\frac{(b+1)}{(a+b+3)}</math> - <math>p(</math>
`(true false true false)`<math>) = \frac{a}{(a+b)} \frac{b}{(a+b+1)} \frac{(a+1)}{(a+b+2)}\frac{(b+1)}{(a+b+3)}</math> - <math>p(</math>
`(false false true true)`<math>) = \frac{b}{(a+b)} \frac{(b+1)}{(a+b+1)} \frac{a}{(a+b+2)}\frac{(a+1)}{(a+b+3)}</math>

There are also sequences which are not exchangeable. For example, we have seen how to represent Markov chains in Church.

For a Markov chain, order matters.

- <math>p(</math>
`(a b c a)`<math>) = \frac{1}{3} \cdot \frac{1}{2} \cdot \frac{1}{2} \cdot \frac{1}{2}</math> - <math>p(</math>
`(a a c b)`<math>) = \frac{1}{3} \cdot 0 \cdot \frac{1}{2} \cdot \frac{1}{2}</math>

Thus Markov generated sequences are not exchangeable (although there is a relevant generalization of the notion of exchangeability in the Markov case).

### Exchangeability and Purity

Above we discussed the notion of functional purity and its importance in deterministic programming. We also discussed how functionally pure programs express causal rather than temporal dependence. Generative models are also meant to express causal dependence, however, as we discussed above they are not pure in the classical sense since they in general produce different value on different evaluations.

Elementary random procedures (ERPs) in Church are i.i.d. Thus we might be tempted at first to require that all Church expression compute i.i.d. random variables. This turns out to be very restrictive. Consider the following program.

Here we have defined a procedure, which first samples a weight and then produces a sequence of flips with this bias. Are the individual flips i.i.d.? Is the sequence produced by this generative model i.i.d.? It is not. To convince yourself of this think about the value of the next flip after you have observed some data. Suppose that you have seen a sequence where all of the outcomes were `true`. You would probably guess that if the next coin flip will also be true. That is, your posterior distribution over the next flip is dependent on the preceding observations. This can be shown by the following Church query.

As you can see here, there is high posterior probability that the next flip is true. What this means is that the individual flips are not i.i.d. The joint distribution over flips depends on all of the outcomes at once.

This fact is the result of the model containing a *hidden variable*, `weight` upon which the flips are conditioned. If we knew the value for this hidden variable ahead of time then the individual flips would be i.i.d. However, we do not know the hidden variable instead we integrate it out (reminder: integrating out a variable in Church means ignoring it). The probability of a particular sequence of flips is the probability averaged over all possible values of `weight`. The individual flips are coupled; previous flips influence you guesses about future flips.

Most Church programs have this property: the eventual output is conditioned on many hidden variables. If we knew the values of these variables then each outcome would be i.i.d. However, because we marginalize these variables away, observations are not i.i.d. However, as we will see below, they are always guaranteed to be exchangeable.

Intuitively exchangeability means that a distribution has no dependence on the order of events. Above we discussed how one way of understanding functional purity is that a pure function has no temporal dependence. Exchangeability is therefore the natural generalization of purity in the stochastic setting. Church is a probabilistically functionally pure language, which means that all Church expressions must implement exchangeable distributions.

### De Finetti's Theorem

Exchangeability is a fundamental concept in Bayesian statistics—independent of its importance in probabilistic programming. One of the reasons for the importance of exchangeability as a concept is the mathematical result known as De Finetti's theorem.

Suppose that you have some exchangeable sequence of random variables

- <math>X_1, X_2, ...</math>

De Finetti's theorem says that there exists (with probability 1) a (random) distribution, called the *directing random measure*, <math>\nu</math>, for which the sequence is conditionally i.i.d. and that <math>\nu</math> is drawn from another distribution,<math>F</math> called the *De Finetti measure* which is guaranteed to exist.

Intuitively, what this means is that every exchangeable sequence also has another representation as a hierarchical model where we first draw <math>\nu \sim F</math> and then sample the observations of the sequence i.i.d. from <math>\nu</math>.

Earlier we discussed the exchangeable sequence generated by the process encapsulated by the following code.

(define (sample-coin a b) (let ((heads a) (total (+ a b))) (lambda () (let ((x (flip (/ heads total))]) (set! heads (+ heads (if x 1 0))) (set! total (+ total 1)) x ))))

De Finetti's theorem tells us that there is an alternate representation for this exchangeable sequence where the actual sequence itself is generated i.i.d. from some hidden distribution. We actually have already seen the corresponding De Finetti representation.

This is the simple beta-Bernoulli model where we first draw the weight of a coin, and then package it inside of a closure. The earlier process where we updated counts after each sample is called the *Polya urn scheme* representation of the beta-Bernoulli.

The remarkable and surprising thing about these two programs is that thy define *the exact same probability distribution over sequences*. In other words, in the limit, the urn scheme representation will converge on some ratio of `true` and `false` outcomes. The probability of it converging onto a particular value is given by <math>Beta(a,b)</math>. Thus the distribution that it defines can is also given by first drawing a weight from <math>Beta(a,b)</math> and then sampling the sequence i.i.d.

#### Example: Multinomial-Dirichlet Distributions

De Finetti tells us that every distribution that generates exchangeable sequences has such a form, we will see several examples of this as we continue the tutorial. For now, we note that the beta-Bernoulli model generalizes straightforwardly to the multinomial-Dirichlet distribution.

A natural prior distribution on multinomial distributions over <math>K</math> outcomes is the Dirichlet distribution, which defines a measure on length <math>K</math> vectors which sum to 1.

This model has a simple De Finetti representation which also based on a Polya urn scheme. This is shown in the following sketch Scheme code.

(define (sample-die outcomes counts) (lambda () (let ((total (apply + counts)) (weights (map (lambda (y) (/ y total)) counts)) (x (multinomial outcomes weights))) ( ... add 1 to count of outcome x ...) x)))

We start by sampling the first outcome from a normalized version of the pseudo-counts. After each sample we update the counts with what we just sampled, and the normalized version of these new counts becomes the distribution for the next sample.

#### De Finetti and Frequentist Statistics

Exchangeability is an important concept in Bayesian statistics. One of the reasons for this is that De Finetti's theorem clarifies the relationship between Bayesian and frequentist approaches.

In our discussion of De Finetti's theorem above we neglected to mention an important additional result. The directing random measure <math>\nu</math> is given by the following formula.

- <math>\nu(B) = \lim_{n \to +\infty} \frac{1}{n}\sum_{i=1}^{n} \chi_{B}(X_i) </math>

This equation says that the probability of a set <math>B</math> given by the distribution <math>\nu</math> is the proportion of the <math>X_i</math>'s in our data that fall within that set as we take the length of the sequence to infinity. In other words, the intuition that the probability of an outcome can be estimated by counting and dividing by the total is correct for exchangeable sequences.

In frequentist statistics one often sets up a model as follows. We observe some data <math>X_1, ..., X_n</math>. We assume that these data have been generated i.i.d. from some unknown distribution <math>\nu</math>. We *define* <math>\nu</math> as being the limiting proportions of each of the possible outcomes in <math>X_1, ..., X_n</math>.

De Finetti's theorem says that if a sequence of observations is exchangeable then the frequentist setup is approach is justified because <math>\nu</math> exists with probability 1.

# XRPs

The preceding discussion raises an important issue that arises when working with exchangeable distributions. Consider a simple beta-Bernoulli model, such as the following.

We discussed earlier that in such a setting, when we condition on all the data being `true` our intuitions tell use that the probability of the next data point being true is high.

Here we have used the De Finetti representation of this model, explicitly representing the draw from the beta distribution and sampling the sequence i.i.d. on this draw. However, we have also discussed the fact that there is representation for the beta-Bernoulli: the Polya urn scheme.

(define (sample-coin a b) (let ((heads a) (total (+ a b))) (lambda () (let ((x (flip (/ heads total))]) (set! heads (+ heads (if x 1 0))) (set! total (+ total 1)) x ))))

In the Polya urn scheme representation we update the counts of each outcome as we sample them. Suppose that we had a Church primitive which implemented the Polya urn scheme above. We could then write our query as follows.

(repeated-mh-def-query 100 10 '(define my-coin (sample-coin 1 1)) '(define observations (repeat 10 my-coin)) '(my-coin) '(equal? observations '(true true true true true true true true)) (get-current-environment))

Here `sample-coin` has encapsulated some state which keeps track of the counts of outcomes as it generates them. What is the state of these counts after generating the data, i.e. what is the probability that the next observation is a `head`?

After observing 10 `true`s our counts will be 11 for `true` and 1 for `false`. This means that the next observation will have probability <math>\frac{11}{12}</math> of coming up heads. To do inference in the De Finetti representation we will have to search over the space of possible draws from the beta distribution. The urn scheme representation explicitly rather than implicitly integrates this variable away. This means that the posterior distribution over the next draw is explicitly represented by the state encapsulate inside of the procedure implementing the urn-scheme, we do not have to search over it.

The result can be a great gain in efficiency. In Church, representations of exchangeable distribution with encapsulated state can be provided via the mechanism of *Exchangeable Random Procedures* (XRPs).

From within a Church program XRPs are like ERPs but for exchangeable (rather than i.i.d.) distributions. In general any Church procedure where we can compute the marginal score of each of its outcomes can be packaged behind the XRP interface and made available for use in normal Church programs. A number of useful XRPs are already provided in the XRP library.