# Simple Generative Models

By Timothy J. O'Donnell, Noah D. Goodman, Andreas Stuhlmueller, and the Church Working Group.

## Defining Simple Generative Models Using the $\lambda$-calculus, LISP, and Church

As our formal model of computation we start with the $\lambda$-calculus, and its embodiment in the LISP family of programming languages. The $\lambda$-calculus is a formal system which was invented by Alonzo Church in the 1920's. [1] Church introduced the $\lambda$-calculus as a model and formalization of computation, that is, as a model of the process of computing things like mathematical proofs. The $\lambda$-calculus was shown to have the same computational power as the Turing machine and vice versa. Thus the lambda calculus is a universal model of computation. In 1958 John McCarthy introduced LISP (LISt Processing), a programming language based on the $\lambda$-calculus.[2] Scheme is a variant of LISP developed by Guy L. Steele and Gerald Jay Sussman with particularly simple syntax and semantics. [3] We will use Scheme-style notation for the $\lambda$-calculus in this tutorial. The Church programming language—named in honor of Alonzo Church— is a generalization of Scheme which introduces the notion of probabilistic computation to the language. In the next few sections of this tutorial, we introduce the basics of Church programming and show how to use it to build generative models.

The $\lambda$-calculus formalizes the notion of computation using functions. Computation is performed in the $\lambda$-calculus by applying functions to arguments to compute results. Function application in Church looks like this:

{{#churchserv: (+ 1 1) }}

Here we see the function + applied to two arguments: 1 and 1. Note that in Church the name of the function comes first. This is sometimes called Polish notation. If you run this simple program above you will see the resulting value of this function application. + is a deterministic function. In Church to the set of Scheme primitive deterministic functions, we add a set of random primitives implementing random variables. In Church these random primitive functions are called Elementary Random Procedures (ERPs). Church also contains another kind of random primitive called an Exchangeable Random Procedure (XRP) which we will learn about later. In Church, application of an ERP results in a sample from the distribution over the random variable defined by that ERP. The simplest ERP is flip which results in a (possibly biased) coin toss. Note that we will let this coin be labeled with true and false (which look like this, #t and #f, in the output) on its two sides rather than heads and tails.

{{#churchserv: (flip) }}

Run this program a few times. You will get back a different sample on each execution. Also, notice the parentheses around flip. These are meaningful; they tell Church that you are asking for an application of the ERP flip. Delete the parentheses and see what happens. The giant object that was returned is the flip procedure object. Procedure is just another word for an actual implementation of some function.[4] In Church, each time you run a program you get a sample from the random variable that the program defines. If you run the program many times, and collect the values in a histogram, you will reconstruct the probability distribution that the program defines over the random variable.

{{#churchserv: (hist (repeat 1000 flip) "Flips") }}

Here we have used the repeat procedure which takes a number of repetitions, $K$, and a distribution (in this case flip) and returns a set of $K$ samples from that distribution. We have used the hist procedure to display the results of taking 1000 samples from flip. As you can see, the result is an approximately uniform distribution over true and false. An important idea here is that flip can be thought of in two different ways. From one perspective flip is a procedure which computes samples from a fair coin. That it is it is a sampler from a random variable with a certain probability distribution. From another perspective, flip is itself simply a characterization of the distribution over true and false. When we think about probabilistic programs we will often move back and forth between these two views, emphasizing either the sampling perspective or the distributional perspective.

A Church program is syntactically composed of nested expressions. Roughly speaking an expression is either a simple value, or anything between parentheses (). In a deterministic LISP, like Scheme, all expressions have a single fixed value which is computed by a process known as evaluation. For example, the branching expression, if, while not a function (because it does not evaluate all of its arguments, but instead short-circuits), still has a value.

{{#churchserv: (if (equal? 10 11)

100
'some-symbol)

}}

In the preceding code, equal? is a function which checks for structural equality between objects (as opposed to checking that they are the exact same object occupying the same spot in memory). It is a predicate—a function which returns true or false. By convention in Church, predicates end in the ? character. The value of the if expression is the symbol some-symbol. Notice the single quote in front of some-symbol; this tells Church to treat what follows the quote as that literal sequence of character, in this case it tells Church that some-symbol is a symbol and not a variable which needs to be de-referenced. Note that symbols in Church are a different data type from strings, which look like this "my-string".

{{#churchserv: (if (flip)

100
'some-symbol)

}}

If you run the (if (equal? ...)) program a thousand times, you will get back the same answer, because it is deterministic. In general in Church, on the other hand, each evaluation of an expression can return a different value. So the value of the if expression above changes from evaluation to evaluation. In Church, every evaluation of an expression is interpreted as sampling from random variable. This is important, because, as we will see below, it means Church allows us to define distributions over unbounded collections of random variables. So far we have only seen the unbiased form of flip. flip also has weighted version which can flip biased coins. For example, (flip .7) flips a coin that is weighted $.7$ true and $.3$ false.

### Making New Procedures with lambda

We often want to name objects in our programs so that they can be reused. This can be done with the define statement. define looks like this:

(define variable-name expression)

variable-name is a symbol and which is bound to the value of whatever expression evaluates to. When variables themselves are evaluated the return the value they have been bound to.

{{#churchserv: (define some-variable 'some-symbol)

(if (flip)

100
some-variable)

}}

If you run this program several times you will notice that it sometimes returns the symbol some-symbol. This is because when flip returns false the variable some-variable is evaluated, since it is bound to some-symbol, that value is returned.

Function application is one basic operation of computation in the $\lambda$-calculus. The basic operation which allows us to build up new procedures is called $\lambda$-abstraction. It is this operation which gives the $\lambda$-calculus its name and its power as a universal language.

{{#churchserv: (lambda (p x y) (if (flip p) x y)) }}

The lambda operator takes two arguments. First it takes a list of formal parameters, in this case p, x and y. Second, it takes the body of the function being defined. The body of the function can use the formal parameters as variables. Here we have defined a procedure which takes three arguments, the first is a weight, and the second and third are arbitrary objects. The body flips a coin with weight p and return the first object if it comes up true and the second object if it comes up false. The value of the lambda expression itself is a procedure object (like the one we saw earlier). This procedure object can be applied to arguments. Note that Church (like all LISPs) is dynamically typed, this means it is the caller's responsibility to make sure that the values passed into the formal parameters of a procedure are of the right type.

{{#churchserv: ((lambda (p x y) (if (flip p) x y)) .7 1 2) }}

Here we create a procedure object and immediately apply it to some arguments. This procedure never had a name. This kind of procedure is often called an anonymous procedure. It is often useful to name procedures so that we can reuse them later. Procedures are just values in Church like symbol, numbers, and strings, therefore we can use define to bind a variable name to them. We might call the procedure we have just defined choose.

{{#churchserv: (define choose

(lambda (p x y)
(if (flip p) x y)))

(choose .7 1 2) }}

$\lambda$-abstraction allows us to build more and more complex procedures from simpler ones.

{{#churchserv: (define choose

(lambda (p x y)
(if (flip) x y)))

(define choose-then-multiply

(lambda (p x y z)
(* (choose p x y) z)))

(choose-then-multiply .5 2 3 4) }}

Since (define procedure-name (lambda (...) ...)) is such a common idiom in Church the language provides some syntactic sugar for doing this. It looks like this.

{{#churchserv: (define (choose p x y)

(if (flip p) x y))

(define (choose-then-multiply p x y z)

(* (choose p x y) z))

(choose-then-multiply .5 2 3 4) }}

#### Example: Binomial Distributions

A well-know probability distribution is the binomial distribution. The binomial distribution is a distribution on natural numbers that is defined as the probability of $K$ successes in $N$ tosses of a biased coin with weight $p$. Each of these independent coin tosses is called a Bernoulli random variable. Thus the binomial is the sum of $N$ Bernoulli random variables. Here is the generative process for the binomial distribution.

{{#churchserv: (define (choose p x y)

(if (flip p) x y))

(define (bit) (choose .5 1 0))

(define (binomial N)

(apply + (repeat N bit)))

(hist (repeat 1000 (lambda () (binomial 10))) "Binomial") }}

Here we used choose to define bit which returns a count (1 or 0). We then sample N coin tosses using repeat and summed the number of 1's. Note that here we have used the following idiom.

{{#churchserv: (apply + '(1 2 3 4)) }}

In Church apply is just another procedure like any other. Thus we can call it directly rather than using the more traditional notation.

{{#churchserv: (+ 1 2 3 4) }}

The syntax of apply is:

(apply procedure-name list-of-arguments)

We used (apply proc ...) in this case rather than (proc ...) because repeat returned a list as output. However, + expects a number of independent values (not a list) as input. If we had tried to pass the list to + directly, we would have created a type error: (+ '(1 2 3 4)).

## Defining Hierarchical Generative Models

ERPs themselves represent simple generative processes. For example, flip (with no arguments) defines a generative process that generates observation from an unbiased coin. It models the (admittedly trivial) "causal process" of flipping an unbiased coin. We have just built the generative model for the binomial distribution, which involves summing a number of coin independent biased coin flips. The next level of complexity in generative processes comes from hierarchical generative models. These are generative processes where later random choices depend on earlier ones. To motivate hierarchical generative models let's start by making ourselves a biased coin.

{{#churchserv: (define my-coin (lambda () (flip .7)))

(my-coin) (my-coin)

}}

Our "coin" is a procedure of no arguments that can be called ("flipped") as many times as we like. In functional programming a procedure of no arguments like this one is called a thunk.

{{#churchserv: (define my-thunk

(lambda ()
(+ 1 2)))

(my-thunk) }}

A deterministic thunk can only ever return a single value. Thus in deterministic LISPs, like Scheme, we can think of thunks as constant values. For example, the thunk above is equivalent in meaning to the value 3 with respect to the behavior of a program it is used in. However, in Church, the meaning of a thunk is no longer a necessarily a single value. Thunks can return different values on different calls. In fact, as we will see later, in Church thunks are the definition of sampleable probability definition.

In defining my-coin as a thunk we "wrapped-up" or "encapsulated" the coin weight $.7$ inside of the procedure. In functional programming this construct, where we wrap up some information inside of a procedure is called a closure, because some information is "closed over" in the body of the function. This is a fundamental and important concept, as we will see shortly. It is not only constants like .7 which can be closed over, but also other kinds of information.

{{#churchserv: (define my-coin-weight .7) (define my-coin (lambda () (flip my-coin-weight)))

(my-coin) }}

Here we have captured a variable my-coin-weight in the closure, rather than a simple value. Using thunks and closures we can now implement our first hierarchical generative process. Let's imagine that we want to model a mint producing coins (with poor quality control). The coins will be mostly close to unbiased, but not all exactly weighted at .5. We can achieve this by modifying the proceeding code to first sample the weight of the coin from some distribution, and then wrap up that weight inside of the thunk. First, we will need a new ERP to sample the coin weight from. For this, we will use a beta distribution. A beta distribution is a distribution on real numbers between 0 and 1, thus it can be thought of as a distribution on coin weights. The beta distribution takes two parameters that are called pseudo-counts. Pseudo-counts can be thought of the number of heads or tails that came up in some prior set of coin tosses. Here are a few examples of different values for the parameters of the beta distribution.

To understand how the pseudo-counts work notice that as the proportion of pseudo-counts favors either heads or tails, then the beta distribution becomes more skewed towards coin weights which are biased in that direction. When the pseudo-counts are symmetric, then the distribution is peaked at .5. As the total number of pseudo-counts becomes greater then the distribution becomes more and more steeply peaked. For instance, $Beta(2,2)$ and $Beta(10,10)$ both have their mean at .5, but the latter has much lower variance. Note that $Beta(1,1)$ is the uniform distribution on the (0,1) interval. When the pseudo-counts are less than 1 then the distribution becomes more steeply peaked close to 0 and 1 (note that this is also where our interpretation of pseudo-counts as previous flips starts to break down).

{{#churchserv: (define my-coin-weight (first (beta 1 1))) (define my-coin (lambda () (flip my-coin-weight)))

(my-coin) }}

In this example it was extremely important that we used a closure. This allows all applications of my-coin to be flips of a coin with the same weight. If instead we had written this:

{{#churchserv: (define my-coin (lambda () (flip (beta 1 1))))

(my-coin) }}

Then each time we applied my-coin we would first sample a different coin weight. The beta distribution in these examples is what is called prior distribution on the coin weight. When we build hierarchical models we talk about each level being a prior on the one below.

### Binding Local Variables with let

This is good point to introduce another very important and useful kind of LISP expression. define let's us name global variables—variables which we have access to anywhere in the program—however, often we want to bind a variable in some local context. let allows us to do this. let has the following syntax:

(let ( (var-name-1 expression-1)
(var-name-2 expression-2)
...
(var-name-N expression-N) )
body-of-let-where-variables-have-scope )

The variables inside of the let expression only have scope until the end of the let expression. The value of the whole let-expression is just the value of the body—whatever the body evaluates to. We can use let to define our hierarchical coin.

{{#churchserv: (define my-coin (let ((weight (beta 1 1))) (lambda () (flip weight))))

(my-coin) }}

We use let to bind a draw from beta to the variable weight, we then enclose this variable in a closure and return it. There is something worth noting about this example. The value of the let-expression here is a procedure object returned by lambda. In LISP, procedures are first-class objects. This means that they can be the value of an expression, they can be passed to other procedures as arguments, they can be returned by procedures, and they can be stored as data.

let is actually just syntactic sugar for another variable binding construct we have already seen: $\lambda$-abstraction. The schematic let expression above could be rewritten as the following lambda-expression.

((lambda (var-name-1 var-name-2 ... var-name-N)
body-of-lambda-where-variables-have-scope ) expression-1 expression-2 ... expression-N)

Here we have created an anonymous procedure with formal parametersvar-name-1 var-name-2 ... var-name-N, and have applied this to the expressions that get bound into those variables as arguments. Internally Church turns let expressions into lambda expressions of this form, but it is provided because it is much easier to read.

Sometimes it is useful to refer to one of the earlier variables when defining later variables in a let-binding. To do this you must let* which has the following syntax.

(let* ( (var-name-1 expression-1)
(var-name-2 expression-2-using-variable-1)
...
(var-name-N expression-N-using-variables-1-to-(N-1) ))
body-of-let-where-variables-have-scope )

For example, suppose that we wanted to put a prior on our beta distribution pseudo-counts. A typical distribution for this purpose is the gamma distribution. We can do this using let*

{{#churchserv: (define my-coin (let* ((pseudo-head (gamma 1 2))

(pseudo-tail (gamma 1 2))

(lambda () (flip weight))))

(my-coin) }}

Ealier we claimed that the variables bound by a let only have scope inside the body of the let. One thing you may be wondering is how the procedure my-coin still has access to the values bound by the let after it has been returned. To explain this, we need to explain how variables work in Church. Scheme and Church are languages that are lexically or statically scoped. What this means is that value of a variable is the value that the variable had wherever it was defined. In other words, the value of the variables named weight inside of the closure is the value it had inside the let binding. In particular, this means that the following code works properly.

{{#churchserv: (define my-coin (let ((weight (beta 1 1))) (lambda () (flip weight))))

(define weight 'some-symbol)

(my-coin) }}

How do Scheme and Church implement this? All variable bindings are stored in a data structure called an environment. The environment is essentially a set of key-value pairs, where the keys are variable names and the values are the values that the variable names are bound to. When a procedure object is constructed in Church it gets a copy of the local environment that exists at the moment it is defined. When we call that procedure it looks up the variables in this local environment. Variable bindings in environments are ordered so that the "closest" binding is used. For example, the following code also still works correctly.

{{#churchserv: (define weight 'some-symbol)

(define my-coin (let ((weight (beta 1 1))) (lambda () (flip weight))))

(my-coin)

}}

The first binding to weight, bound by the define, is shadowed, by the second, so when we evaluate the procedure object that is bound to my-coin it always uses the more local, "closer" let-bound version of weight.

## Representing Structure with Pairs and Lists

So far we have seen examples of several different kinds of simple data types. We have seen booleans which take values in {true, false}, we have seen numbers, we have seen symbols, we have seen procedures, and we have mentioned strings. Another fundamental data type in Church is the pair. A pair is just a pair of values of any other data types. For example, here is a pair of a string and a symbol.

{{#churchserv: (pair "my-string" 'my-symbol) }}

Pairs can be used to build up a wide variety of other data structures. For example, we can use pair to represent the branching structure of syntactic trees like the following (note that we not representing constituent node labels).

{{#churchserv: (pair

(pair 'the 'chef )
(pair 'cooks
(pair 'the 'soup )))

}}

In modeling natural language we often wish to work with data that is sequential. For example, sentences can be represented as sequences of words, and words can be represented as sequences of phonemes. A fundamental data type in LISP that will serve our purposes is the list. We can build a list with the list constructor.

{{#churchserv: (list 'a 'b 1 2 3 "string") }}

We can access the first element of a list with the first procedure.

{{#churchserv: (first '(a b 1 2 3 "string")) }}

The procedure rest returns the list minus the first element.

{{#churchserv: (rest '(a b 1 2 3 "some-string")) }}

We can use the procedure list-ref to get the nth element of a list. Remember that lists are indexed starting from 0.

{{#churchserv: (list-ref '(a b 1 2 3 "some-string") 2) }}

When there are no more elements in a list then rest returns the empty or null list which looks like this: ().

{{#churchserv: (rest '(a)) }}

In LISP lists are represented as linked lists. This means that they consist of a series of pairs of values where the second member of each pair points to the next pair and the final one points to the empty list. For instance, this list-constructing code:

{{#churchserv: (list 12 99 37) }}

is equivalent to the following code:

{{#churchserv: (pair 12

(pair 99
(pair 37 '())))

}}

## Bayes Nets in Church

One popular formalism for representing hierarchical generative models is the Bayes net. A Bayes net, also called a Bayesian belief network or a directed graphical model is a graph which encodes the structure of a probabilistic model. It displays graphically which random variables depend on which other random variables. We will illustrate this by example. A textbook example of a scenario that can be represented in a Bayes net, due originally to Judea Pearl, is called the sprinkler problem. Imagine there is some grass lawn which you observe as being wet or dry one morning. You know that there are two possible causes if it is wet. First, it may be because it rained the night before, or, second, it may be because there was an automatic sprinkler system that ran the night before. Notice that while the variable "is the lawn wet" (wet) depends causally on both "did it rain" (rain) and "did the sprinkler run," (sprinkler), rain and sprinkler are independent (suppose that the sprinkler is automatic and not sensitive to the rain or not). Suppose that the prior probability of it raining in general is $.1$ and the probability of the automatic sprinkler running is $.9$. Furthermore, let's suppose there is a small chance of the grass being wet by some other cause (say $.1$).

This can be represented with the following Bayes net.

This Bayes net shows that the variable wet depends on the variables rain and sprinkler, but that they do not depend on one another. It also shows the distributions over the variables. These distributions are represented as conditional probability tables (CPTs). The CPTs for rain and sprinkler are very simple, they just specify the probabilities of each of the two possible outcomes, since neither of these variables depend on anything else. The CPT for wet is more complex. It specifies a conditional probability distribution over the state of the grass for each of the two possible values of rain and sprinkler. Rain or sprinkling cause the grass to be wet with probability 1.0. If neither variable is true then the there is still some small probability that the grass gets wet by other means— thus injecting noise into the model.

This Bayes net can be expressed in Church very compactly.

{{#churchserv: (define rain (flip .1)) (define sprinkler (flip .9)) (define wet (if (or rain sprinkler) true (flip .1))) wet }}

An important observation here is that this Church program actually expresses more knowledge about the generative process than the Bayes net above. In particular, the Bayes net encodes the set of conditional probability distributions as a simple table. The Church program on the other hand gives an explicit formula for sampling the value of wet given the value of the other two variables. This formula encodes the intuitive knowledge: "if it rains or the sprinkler runs, then the grass will definitely be wet, but there is some small chance it might be wet even if none of those things happen." In general, Bayes nets represent dependencies between random variables while hiding much of the computational detail which is actually necessary to compute the values of one variable given another. To take another example consider the binomial distribution we studied above. The Bayes net for this distribution would have $N$ nodes, representing the Bernoulli trials all pointing at another node representing the sum. The sum node would have a CPT which gave a probability for each possible total count. However, nowhere in the Bayes net would it be expressed that the operation linking the Bernoulli nodes and the sum node is addition.

1. Cardone and Hindley, 2006. History of Lambda-calculus and Combinatory Logic. In Gabbay and Woods (eds.), Handbook of the History of Logic, vol. 5. Elsevier.
2. Wikipedia article on LISP.
3. Wikipedia article on Scheme.
4. Technically two mathematical functions are the same if they define the same input-output mappings. Two procedures can still be different if they "work in different ways."