# Simple Generative Models

*To return to the top level: Probabilistic Models of Cognition Tutorial.*

By Timothy J. O'Donnell, Noah D. Goodman, Andreas Stuhlmueller, and the Church Working Group.

## Contents |

## Defining Simple Generative Models Using the <math>\lambda</math>-calculus, LISP, and Church

As our formal model of computation we start with the <math>\lambda</math>-calculus, and its embodiment in the LISP family of programming languages. The <math>\lambda</math>-calculus is a formal system which was invented by Alonzo Church in the 1920's. ^{[1]} Church introduced the <math>\lambda</math>-calculus as a model and formalization of computation, that is, as a model of the process of computing things like mathematical proofs. The <math>\lambda</math>-calculus was shown to have the same computational power as the Turing machine and vice versa. Thus the lambda calculus is a *universal* model of computation. In 1958 John McCarthy introduced LISP (**LIS**t **P**rocessing), a programming language based on the <math>\lambda</math>-calculus.^{[2]} Scheme is a variant of LISP developed by Guy L. Steele and Gerald Jay Sussman with particularly simple syntax and semantics. ^{[3]} We will use Scheme-style notation for the <math>\lambda</math>-calculus in this tutorial. The Church programming language—named in honor of Alonzo Church— is a generalization of Scheme which introduces the notion of probabilistic computation to the language. In the next few sections of this tutorial, we introduce the basics of Church programming and show how to use it to build generative models.

The <math>\lambda</math>-calculus formalizes the notion of computation using *functions.* Computation is performed in the <math>\lambda</math>-calculus by applying functions to arguments to compute results. Function application in Church looks like this:

{{#churchserv: (+ 1 1) }}

Here we see the function `+` applied to two arguments: `1` and `1`. Note that in Church the name of the function comes *first*. This is sometimes called *Polish notation*. If you run this simple program above you will see the resulting *value* of this function application. `+` is a *deterministic* function. In Church to the set of Scheme primitive deterministic functions, we add a set of random primitives implementing *random variables.* In Church these random primitive functions are called *Elementary Random Procedures* (ERPs). Church also contains another kind of random primitive called an *Exchangeable Random Procedure* (XRP) which we will learn about later. In Church, application of an ERP results in a *sample* from the distribution over the random variable defined by that ERP. The simplest ERP is `flip` which results in a (possibly biased) coin toss. Note that we will let this coin be labeled with `true` and `false` (which look like this, `#t` and `#f`, in the output) on its two sides rather than *heads* and *tails*.

{{#churchserv: (flip) }}

Run this program a few times. You will get back a different sample on each execution. Also, notice the parentheses around `flip`. These are meaningful; they tell Church that you are asking for an application of the ERP `flip`. Delete the parentheses and see what happens. The giant object that was returned is the `flip` *procedure* object. Procedure is just another word for an actual implementation of some function.^{[4]} In Church, each time you run a program you get a sample from the random variable that the program defines. If you run the program many times, and collect the values in a histogram, you will reconstruct the probability distribution that the program defines over the random variable.

{{#churchserv: (hist (repeat 1000 flip) "Flips") }}

Here we have used the `repeat` procedure which takes a number of repetitions, <math>K</math>, and a distribution (in this case `flip`) and returns a set of <math>K</math> samples from that distribution. We have used the `hist` procedure to display the results of taking 1000 samples from `flip`. As you can see, the result is an approximately uniform distribution over `true` and `false`. An important idea here is that `flip` can be thought of in two different ways. From one perspective `flip` is a procedure which computes samples from a fair coin. That it is it is a *sampler* from a random variable with a certain probability distribution. From another perspective, `flip` is *itself* simply a characterization of the distribution over `true` and `false`. When we think about probabilistic programs we will often move back and forth between these two views, emphasizing either the sampling perspective or the distributional perspective.

A Church program is syntactically composed of nested *expressions.* Roughly speaking an expression is either a simple value, or anything between parentheses `()`. In a deterministic LISP, like Scheme, all expressions have a single fixed value which is computed by a process known as *evaluation*. For example, the branching expression, `if`, while not a function (because it does not evaluate all of its arguments, but instead *short-circuits*), still has a value.

{{#churchserv: (if (equal? 10 11)

100 'some-symbol)

}}

In the preceding code, `equal?` is a function which checks for *structural* equality between objects (as opposed to checking that they are the exact same object occupying the same spot in memory). It is a *predicate*—a function which returns `true` or `false`. By convention in Church, predicates end in the `?` character. The value of the `if` expression is the *symbol* `some-symbol`. Notice the single quote in front of `some-symbol`; this tells Church to treat what follows the quote as that literal sequence of character, in this case it tells Church that `some-symbol` is a symbol and not a variable which needs to be de-referenced. Note that symbols in Church are a different data type from *strings*, which look like this `"my-string"`.

{{#churchserv: (if (flip)

100 'some-symbol)

}}

If you run the `(if (equal? ...))` program a thousand times, you will get back the same answer, because it is deterministic. In general in Church, on the other hand, each *evaluation* of an expression can return a different value. So the value of the `if` expression above changes from evaluation to evaluation. In Church, every evaluation of an expression is interpreted as sampling from random variable. This is important, because, as we will see below, it means Church allows us to define distributions over unbounded collections of random variables. So far we have only seen the unbiased form of `flip`. `flip` also has weighted version which can flip biased coins. For example, `(flip .7)` flips a coin that is weighted <math>.7</math> `true` and <math>.3</math> `false`.

### Making New Procedures with `lambda`

We often want to name objects in our programs so that they can be reused. This can be done with the `define` statement. `define` looks like this:

(define variable-name expression)

`variable-name` is a *symbol* and which is bound to the value of whatever `expression` evaluates to. When variables themselves are evaluated the return the value they have been bound to.

{{#churchserv: (define some-variable 'some-symbol)

(if (flip)

100 some-variable)

}}

If you run this program several times you will notice that it sometimes returns the symbol `some-symbol`. This is because when `flip` returns `false` the variable `some-variable` is evaluated, since it is bound to `some-symbol`, that value is returned.

Function application is one basic operation of computation in the <math>\lambda</math>-calculus. The basic operation which allows us to build up new procedures is called *<math>\lambda</math>-abstraction*. It is this operation which gives the <math>\lambda</math>-calculus its name and its power as a universal language.

{{#churchserv: (lambda (p x y) (if (flip p) x y)) }}

The `lambda` operator takes two arguments. First it takes a list of *formal parameters,* in this case `p`, `x` and `y`. Second, it takes the *body* of the function being defined. The body of the function can use the formal parameters as variables. Here we have defined a procedure which takes three arguments, the first is a weight, and the second and third are arbitrary objects. The body flips a coin with weight `p` and return the first object if it comes up `true` and the second object if it comes up `false`.
The value of the `lambda` expression itself is a procedure object (like the one we saw earlier). This procedure object can be applied to arguments. Note that Church (like all LISPs) is *dynamically typed*, this means it is the caller's responsibility to make sure that the values passed into the formal parameters of a procedure are of the right type.

{{#churchserv: ((lambda (p x y) (if (flip p) x y)) .7 1 2) }}

Here we create a procedure object and immediately apply it to some arguments. This procedure never had a name. This kind of procedure is often called an
*anonymous procedure*. It is often useful to name procedures so that we can reuse them later. Procedures are just values in Church like symbol, numbers, and strings, therefore we can use `define` to bind a variable name to them. We might call the procedure we have just defined `choose`.

{{#churchserv: (define choose

(lambda (p x y) (if (flip p) x y)))

(choose .7 1 2) }}

<math>\lambda</math>-abstraction allows us to build more and more complex procedures from simpler ones.

{{#churchserv: (define choose

(lambda (p x y) (if (flip) x y)))

(define choose-then-multiply

(lambda (p x y z) (* (choose p x y) z)))

(choose-then-multiply .5 2 3 4) }}

Since `(define procedure-name (lambda (...) ...))` is such a common idiom in Church the language provides some syntactic sugar for doing this. It looks like this.

{{#churchserv: (define (choose p x y)

(if (flip p) x y))

(define (choose-then-multiply p x y z)

(* (choose p x y) z))

(choose-then-multiply .5 2 3 4) }}

#### Example: Binomial Distributions

A well-know probability distribution is the *binomial distribution*. The binomial distribution is a distribution on natural numbers that is defined as the probability of <math>K</math> successes in <math>N</math> tosses of a biased coin with weight <math>p</math>. Each of these independent coin tosses is called a *Bernoulli* random variable. Thus the binomial is the sum of <math>N</math> Bernoulli random variables. Here is the generative process for the binomial distribution.

{{#churchserv: (define (choose p x y)

(if (flip p) x y))

(define (bit) (choose .5 1 0))

(define (binomial N)

(apply + (repeat N bit)))

(hist (repeat 1000 (lambda () (binomial 10))) "Binomial") }}

Here we used `choose` to define `bit` which returns a count (1 or 0). We then sample `N` coin tosses using `repeat` and summed the number of 1's. Note that here we have used the following idiom.

{{#churchserv: (apply + '(1 2 3 4)) }}

In Church `apply` is just another procedure like any other. Thus we can call it directly rather than using the more traditional notation.

{{#churchserv: (+ 1 2 3 4) }}

The syntax of `apply` is:

(apply procedure-name list-of-arguments)

We used `(apply proc ...)` in this case rather than `(proc ...)` because `repeat` returned a *list* as output. However, `+` expects a number of independent values (not a list) as input. If we had tried to pass the list to `+` directly, we would have created a type error: `(+ '(1 2 3 4))`.

## Defining Hierarchical Generative Models

ERPs themselves represent simple generative processes. For example, `flip` (with no arguments) defines a generative process that generates observation from an unbiased coin. It models the (admittedly trivial) "causal process" of flipping an unbiased coin. We have just built the generative model for the binomial distribution, which involves summing a number of coin independent biased coin flips. The next level of complexity in generative processes comes from hierarchical generative models. These are generative processes where later random choices depend on earlier ones. To motivate hierarchical generative models let's start by making ourselves a biased coin.

{{#churchserv: (define my-coin (lambda () (flip .7)))

(my-coin) (my-coin)

}}

Our "coin" is a procedure of no arguments that can be called ("flipped") as many times as we like. In functional programming a procedure of no arguments like this one is called a *thunk.*

{{#churchserv: (define my-thunk

(lambda () (+ 1 2)))

(my-thunk) }}

A deterministic thunk can only ever return a single value. Thus in deterministic LISPs, like Scheme, we can think of thunks as constant values. For example, the thunk above is equivalent in meaning to the value `3` with respect to the behavior of a program it is used in. However, in Church, the meaning of a thunk is no longer a necessarily a single value. Thunks can return different values on different calls. In fact, as we will see later, in Church thunks are the *definition* of sampleable probability definition.

In defining `my-coin` as a thunk we "wrapped-up" or "encapsulated" the coin weight <math>.7</math> inside of the procedure. In functional programming this construct, where we wrap up some information inside of a procedure is called a *closure*, because some information is "closed over" in the body of the function. This is a fundamental and important concept, as we will see shortly. It is not only constants like .7 which can be closed over, but also other kinds of information.

{{#churchserv: (define my-coin-weight .7) (define my-coin (lambda () (flip my-coin-weight)))

(my-coin) }}

Here we have captured a variable `my-coin-weight` in the closure, rather than a simple value. Using thunks and closures we can now implement our first hierarchical generative process. Let's imagine that we want to model a mint producing coins (with poor quality control). The coins will be mostly close to unbiased, but not all exactly weighted at .5. We can achieve this by modifying the proceeding code to first sample the weight of the coin from some distribution, and then wrap up that weight inside of the thunk. First, we will need a new ERP to sample the coin weight from. For this, we will use a *beta* distribution. A beta distribution is a distribution on real numbers between 0 and 1, thus it can be thought of as a distribution on coin weights. The beta distribution takes two parameters that are called *pseudo-counts*. Pseudo-counts can be thought of the number of heads or tails that came up in some prior set of coin tosses. Here are a few examples of different values for the parameters of the beta distribution.

To understand how the pseudo-counts work notice that as the proportion of pseudo-counts favors either heads or tails, then the beta distribution becomes more skewed towards coin weights which are biased in that direction. When the pseudo-counts are symmetric, then the distribution is peaked at .5. As the **total** number of pseudo-counts becomes greater then the distribution becomes more and more steeply peaked. For instance, <math>Beta(2,2)</math> and <math>Beta(10,10)</math> both have their mean at .5, but the latter has much lower variance. Note that <math>Beta(1,1)</math> is the uniform distribution on the (0,1) interval. When the pseudo-counts are less than 1 then the distribution becomes more steeply peaked close to 0 and 1 (note that this is also where our interpretation of pseudo-counts as previous flips starts to break down).

{{#churchserv: (define my-coin-weight (first (beta 1 1))) (define my-coin (lambda () (flip my-coin-weight)))

(my-coin) }}

In this example it was extremely important that we used a closure. This allows all applications of `my-coin` to be flips of a coin with the *same weight*. If instead we had written this:

{{#churchserv: (define my-coin (lambda () (flip (beta 1 1))))

(my-coin) }}

Then each time we applied `my-coin` we would first sample a different coin weight. The beta distribution in these examples is what is called *prior distribution* on the coin weight. When we build hierarchical models we talk about each level being a *prior* on the one below.

### Binding Local Variables with `let`

This is good point to introduce another very important and useful kind of LISP expression. `define` let's us name *global* variables—variables which we have access to anywhere in the program—however, often we want to bind a variable in some local context. `let` allows us to do this. `let` has the following syntax:

(let ( (var-name-1 expression-1) (var-name-2 expression-2) ... (var-name-N expression-N) ) body-of-let-where-variables-have-scope )

The variables inside of the `let` expression only have scope until the end of the `let` expression. The value of the whole `let`-expression is just the value of the body—whatever the body evaluates to. We can use `let` to define our hierarchical coin.

{{#churchserv: (define my-coin (let ((weight (beta 1 1))) (lambda () (flip weight))))

(my-coin) }}

We use `let` to bind a draw from `beta` to the variable `weight`, we then enclose this variable in a closure and return it. There is something worth noting about this example. The value of the `let`-expression here is a **procedure** object returned by `lambda`. In LISP, procedures are *first-class objects*. This means that they can be the value of an expression, they can be passed to other procedures as arguments, they can be returned by procedures, and they can be stored as data.

`let` is actually just syntactic sugar for another variable binding construct we have already seen: <math>\lambda</math>-abstraction. The schematic `let` expression above could be rewritten as the following `lambda`-expression.

((lambda (var-name-1 var-name-2 ... var-name-N) body-of-lambda-where-variables-have-scope ) expression-1 expression-2 ... expression-N)

Here we have created an anonymous procedure with formal parameters`var-name-1 var-name-2 ... var-name-N`, and have applied this to the expressions that get bound into those variables as arguments. Internally Church turns `let` expressions into `lambda` expressions of this form, but it is provided because it is much easier to read.

Sometimes it is useful to refer to one of the earlier variables when defining later variables in a `let`-binding. To do this you must `let*` which has the following syntax.

(let* ( (var-name-1 expression-1) (var-name-2 expression-2-using-variable-1) ... (var-name-N expression-N-using-variables-1-to-(N-1) )) body-of-let-where-variables-have-scope )

For example, suppose that we wanted to put a prior on our beta distribution pseudo-counts. A typical distribution for this purpose is the *gamma distribution*. We can do this using `let*`

{{#churchserv: (define my-coin (let* ((pseudo-head (gamma 1 2))

(pseudo-tail (gamma 1 2)) (weight (beta pseudo-head pseudo-tail)))

(lambda () (flip weight))))

(my-coin) }}

Ealier we claimed that the variables bound by a `let` only have scope inside the body of the `let`. One thing you may be wondering is how the procedure `my-coin` still has access to the values bound by the `let` after it has been returned. To explain this, we need to explain how variables work in Church.
Scheme and Church are languages that are *lexically* or *statically scoped*. What this means is that value of a variable is the value that the variable had wherever it was defined. In other words, the value of the variables named *weight* inside of the closure is the value it had inside the `let` binding. In particular, this means that the following code works properly.

{{#churchserv: (define my-coin (let ((weight (beta 1 1))) (lambda () (flip weight))))

(define weight 'some-symbol)

(my-coin) }}

How do Scheme and Church implement this? All variable bindings are stored in a data structure called an *environment*. The environment is essentially a set of key-value pairs, where the keys are variable names and the values are the values that the variable names are bound to. When a procedure object is constructed in Church it gets a copy of the local environment that exists *at the moment it is defined*. When we call that procedure it looks up the variables in this local environment. Variable bindings in environments are *ordered* so that the "closest" binding is used. For example, the following code also still works correctly.

{{#churchserv: (define weight 'some-symbol)

(define my-coin (let ((weight (beta 1 1))) (lambda () (flip weight))))

(my-coin)

}}

The first binding to `weight`, bound by the `define`, is *shadowed*, by the second, so when we evaluate the procedure object that is bound to `my-coin` it always uses the more local, "closer" `let`-bound version of `weight`.

## Representing Structure with Pairs and Lists

So far we have seen examples of several different kinds of simple data types. We have seen *booleans* which take values in {`true`, `false`}, we have seen *numbers*, we have seen *symbols*, we have seen *procedures*, and we have mentioned *strings*. Another fundamental data type in Church is the *pair*. A pair is just a pair of values of any other data types. For example, here is a pair of a string and a symbol.

{{#churchserv: (pair "my-string" 'my-symbol) }}

Pairs can be used to build up a wide variety of other data structures. For example, we can use `pair` to represent the branching structure of
syntactic trees like the following (note that we not representing constituent node labels).

{{#churchserv:
(pair

(pair 'the 'chef ) (pair 'cooks (pair 'the 'soup )))

}}

In modeling natural language we often wish to work with data that is sequential. For example, sentences can be represented as sequences of words, and words can be represented as sequences of phonemes. A fundamental data type in LISP that will serve our purposes is the *list*. We can build a list with the `list` constructor.

{{#churchserv: (list 'a 'b 1 2 3 "string") }}

We can access the first element of a list with the `first` procedure.

{{#churchserv: (first '(a b 1 2 3 "string")) }}

The procedure `rest` returns the list minus the first element.

{{#churchserv: (rest '(a b 1 2 3 "some-string")) }}

We can use the procedure `list-ref` to get the nth element of a list. Remember that lists are indexed starting from 0.

{{#churchserv: (list-ref '(a b 1 2 3 "some-string") 2) }}

When there are no more elements in a list then `rest` returns the empty or null list which looks like this: `()`.

{{#churchserv: (rest '(a)) }}

In LISP lists are represented as *linked lists.* This means that they consist of a series of pairs of values where the second member of each pair points to the next pair and the final one points to the empty list. For instance, this list-constructing code:

{{#churchserv: (list 12 99 37) }}

is equivalent to the following code:

{{#churchserv: (pair 12

(pair 99 (pair 37 '())))

}}

## Bayes Nets in Church

One popular formalism for representing hierarchical generative models is the *Bayes net*. A Bayes net, also called a *Bayesian belief network* or a *directed graphical model* is a graph which encodes the structure of a probabilistic model. It displays graphically which random variables depend on which other random variables. We will illustrate this by example. A textbook example of a scenario that can be represented in a Bayes net, due originally to Judea Pearl, is called the *sprinkler problem.* Imagine there is some grass lawn which you observe as being wet or dry one morning. You know that there are two possible causes if it is wet. First, it may be because it rained the night before, or, second, it may be because there was an automatic sprinkler system that ran the night before. Notice that while the variable "is the lawn wet" (`wet`) depends causally on both "did it rain" (`rain`) and "did the sprinkler run," (`sprinkler`), `rain` and `sprinkler` are independent (suppose that the sprinkler is automatic and not sensitive to the rain or not). Suppose that the prior probability of it raining in general is <math>.1</math> and the probability of the automatic sprinkler running is <math>.9</math>. Furthermore, let's suppose there is a small chance of the grass being wet by some other cause (say <math>.1</math>).

This can be represented with the following Bayes net.

This Bayes net shows that the variable `wet` depends on the variables `rain` and `sprinkler`, but that they do not depend on one another. It also shows the distributions over the variables. These distributions are represented as *conditional probability tables* (CPTs). The CPTs for `rain` and `sprinkler` are very simple, they just specify the probabilities of each of the two possible outcomes, since neither of these variables depend on anything else. The CPT for `wet` is more complex. It specifies a *conditional probability distribution* over the state of the grass for each of the two possible values of `rain` and `sprinkler`. Rain or sprinkling cause the grass to be wet with probability 1.0. If neither variable is true then the there is still some small probability that the grass gets wet by other means— thus injecting *noise* into the model.

This Bayes net can be expressed in Church very compactly.

{{#churchserv: (define rain (flip .1)) (define sprinkler (flip .9)) (define wet (if (or rain sprinkler) true (flip .1))) wet }}

An important observation here is that this Church program actually expresses more knowledge about the generative process than the Bayes net above. In particular, the Bayes net encodes the set of conditional probability distributions as a simple table. The Church program on the other hand gives an explicit formula for sampling the value of `wet` given the value of the other two variables. This formula encodes the intuitive knowledge: "if it rains or the sprinkler runs, then the grass will definitely be wet, but there is some small chance it might be wet even if none of those things happen."
In general, Bayes nets represent dependencies between random variables while hiding much of the computational detail which is actually necessary to compute the values of one variable given another. To take another example consider the binomial distribution we studied above. The Bayes net for this distribution would have <math>N</math> nodes, representing the Bernoulli trials all pointing at another node representing the sum. The sum node would have a CPT which gave a probability for each possible total count. However, nowhere in the Bayes net would it be expressed that the *operation* linking the Bernoulli nodes and the sum node is addition.

- ↑ Cardone and Hindley, 2006. History of Lambda-calculus and Combinatory Logic. In Gabbay and Woods (eds.),
*Handbook of the History of Logic*, vol. 5. Elsevier. - ↑ Wikipedia article on LISP.
- ↑ Wikipedia article on Scheme.
- ↑ Technically two mathematical functions are the same if they define the same input-output mappings. Two procedures can still be different if they "work in different ways."