# Generative Models

*To return to the top level: Probabilistic Models of Cognition.**To see useful variations on the examples from this page: Generative Models: Variations.*

## Contents |

## Defining Simple Generative Models

As our formal model of computation we start with the <math>\lambda</math>-calculus, and its embodiment in the LISP family of programming languages. The <math>\lambda</math>-calculus is a formal system which was invented by Alonzo Church in the 1920's.^{[1]} Church introduced the <math>\lambda</math>-calculus as a model and formalization of computation, that is, as a way of formalizing the notion of an effectively computable function. The lambda calculus is a *universal* model of computation—it is conjectured to be equivalent to all other notions of classical computation (the <math>\lambda</math>-calculus was shown to have the same computational power as the Turing machine and vice versa by Alan Turing in his famous paper which introduced the Turing machine). It is remarkable that the <math>\lambda</math>-calculus is universal because it has only two basic operations: creating and applying functions.

In 1958 John McCarthy introduced LISP (**LIS**t **P**rocessing), a programming language based on the <math>\lambda</math>-calculus.^{[2]} Scheme is a variant of LISP developed by Guy L. Steele and Gerald Jay Sussman with particularly simple syntax and semantics.^{[3]} (For a simple tutorial on the Scheme language, see [1] or [2].) We will use Scheme-style notation for the <math>\lambda</math>-calculus in this tutorial. The Church programming language—named in honor of Alonzo Church—is a generalization of Scheme which introduces the notion of probabilistic computation to the language. In the next few sections of this tutorial, we introduce the basics of Church programming and show how to use it to build generative models.

The <math>\lambda</math>-calculus formalizes the notion of computation using *functions.* Computation is performed in the <math>\lambda</math>-calculus by applying functions to arguments to compute results. Function application in Church looks like this:

Here we see the function `+` applied to two arguments: `1` and `1`. Note that in Church the name of the function comes *first* and the parentheses go outside the function. This is sometimes called *Polish notation*. If you run this simple program above you will see the resulting *value* of this function application. Since `+` is a *deterministic* function you will always get the same return value if you run the program many times.

In Church, in addition to deterministic functions, we have a set of random functions implementing *random choices.* These random primitive functions are called *Exchangeable Random Primitives* (XRPs). Application of an XRP results in a *sample* from the probability distribution defined by that XRP. For example, the simplest XRP is `flip` which results in either true or false—it simulates a (possibly biased) coin toss. (Note that the return values `true` and `false` will look like this in the output: `#t` and `#f`.)

Run this program a few times. You will get back a different sample on each execution. Also, notice the parentheses around `flip`. These are meaningful; they tell Church that you are asking for an application of the XRP `flip`. Delete the parentheses and see what happens. The object that was returned is the `flip` *procedure* object. *Procedure* is just another word for an actual implementation of some function. In Church, each time you run a program you get a *sample* by simulating the computations and random choices that the program specifies. If you run the program many times, and collect the values in a histogram, you can see what a typical sample looks like:

Here we have used the `repeat` procedure which takes a number of repetitions, <math>K</math>, and a random distribution (in this case `flip`) and returns a list of <math>K</math> samples from that distribution. We have used the `hist` procedure to display the results of taking 1000 samples from `flip`. As you can see, the result is an approximately uniform distribution over `true` and `false`.

An important idea here is that `flip` can be thought of in two different ways. From one perspective, `flip` is a procedure which returns a sample from a fair coin. That is, it's a *sampler* or *simulator*. From another perspective, `flip` is *itself* a characterization of the distribution over `true` and `false`. When we think about probabilistic programs we will often move back and forth between these two views, emphasizing either the sampling perspective or the distributional perspective. (With suitable restrictions this duality is complete: any Church program implicitly represents a distribution and any distribution can be represented by a Church program. See The Meaning of Probabilistic Programs for more details on this duality.) We return to this relationship between probability and simulation below.

The `flip` function is the simplest XRP in Church, but you will find other XRPs corresponding to familiar probability distributions, such as `gaussian`, `gamma`, `dirichlet`, and so on.

Many but not all of the XRPs and other basic functions implemented in Church can be found on the Automatically generated documentation page.

### Building More Complex Programs

A Church program is syntactically composed of nested *expressions.* Roughly speaking an expression is either a simple value, or anything between parentheses `()`. In a deterministic LISP, like Scheme, all expressions without free variables have a single fixed value which is computed by a process known as *evaluation*. For example, consider the following more complex expression:

This expression has an *operator*, the logical function `and`, and *arguments*, `true` and the *subexpression* which is itself an application of `or`. When reasoning about evaluation, it is best to think of evaluating the subexpressions first, then applying the function to the return values of its arguments. In this example `or` is first applied to `true` and `false`, returning true, then `and` is applied to `true` and the subexpression's return value, again returning true.

For another example, consider:

This expression is composed of an `if` conditional that evaluates the first expression (a flip here) then evaluates the second expression if the first is true or otherwise evaluates the third expression.^{[4]}
(We have also used comments here: anything after a semicolon is ignored when evaluating.) Notice that the first `flip` has an argument: flip with an argument is a biased random choice. In this case this flip will be true with probability 0.7.

We often want to name objects in our programs so that they can be reused. This can be done with the `define` statement. `define` looks like this:

(define variable-name expression)

`variable-name` is a *symbol* that is bound to the value that `expression` evaluates to. When variables themselves are evaluated they return the value that they have been bound to:

If you run this program several times you will notice that it sometimes returns 10. This is because when `flip` returns `false` the variable `some-variable` is evaluated, and since it is bound to 10, that value is returned.

### Lists and symbols

There are several special kinds of values in Church. One kind of special value is a *list*: a sequence of other values. Lists can be built using the `list` function and manipulated using functions such as `first` (get the first element in a list) and `rest` (get the rest of the list after the first element).
Experiment with different combinations of these functions. What happens when you apply `list` to the result of `list`? (In this program we have used another kind of special value: a string, which is anything between double-quotes.)

Lists are a fundamental data structure in LISP-like languages. In fact, a Church program is just a list—the list of expressions comprising the program.

Sometimes we wish to treat a piece of Church code "literally", that is, to view it as a value rather than an expression. We can do this using a special `quote` operation (which can also be written with a single quote: `(quote ...)` is the same as `'...`):
What is the value of `quoted-value`? What happens if you remove the quote? Why?

If we quote the name of a variable, what we get back is a symbol: a value which is the literal variable, rather than the value it is bound to. A symbol is just an identifier; it will be equal to itself, but to no other symbol: The ability to make new symbols as needed is a crucial feature in building models that reason over unbounded worlds, as we'll see below.

### Example: Causal Models in Medical Diagnosis

Generative knowledge is often *causal* knowledge. As an example of how causal knowledge can be encoded in Church expressions, consider a simplified medical scenario:

This program generates random conditions for a patient in a doctor's office. It first specifies the base rates of two diseases the patient could have: lung cancer is rare while a cold is common, and there is an independent chance of having each disease. The program then specifies a process for generating a common symptom of these diseases -- an effect with two possible causes: The patient coughs if they have a cold or lung cancer (or both). Here is a slightly more complex causal model:

Now there are four possible diseases and four symptoms. Each disease causes a different pattern of symptoms. The causal relations are now probabilistic: Only some patients with a cold have a cough (50%), or a fever (30%). There is also a catch-all disease category "other", which has a low probability of causing any symptom. *Noisy logical* functions, or functions built from `and`, `or`, and `flip`, provide a simple but expressive way to describe probabilistic causal dependencies between Boolean (true-false valued) variables.

When you run the above code, the program generates a list of symptoms for a hypothetical patient. Most likely all the symptoms will be false, as (thankfully) each of these diseases is rare. Experiment with running the program multiple times. Now try modifying the `define` statement for one of the diseases, setting it to be true, to simulate only patients known to have that disease. For example, replace `(define lung-cancer (flip 0.01))` with `(define lung-cancer true)`. Run the program several times to observe the characteristic patterns of symptoms for that disease.

## Prediction, Simulation, and Probabilities

Suppose that we flip two fair coins, and return the list of their values:
How can we predict the return value of this program? For instance, how likely is it that we will see `(#t #f)`? A **probability** is a number between 0 and 1 that expresses the answer to such a question: it is a degree of belief that we will see a given outcome, such as `(#t #f)`. The probability if an event <math>A</math> (such as the above program returning `(#t #f)`) is usually written as: <math>P(A)</math>.

As we did above, we can sample many times and examine the histogram of return values:
(The parentheses around `random-pair` make an object that can be sampled from many times in a single execution of the program; this is described below, but the details aren't important for now.)

We see by examining this histogram that `(#t #f)` comes out about 0.25 of the time. We may define the **probability** of a return value to be the fraction of times (in the long run) that this value is returned from evaluating the program—then the probability of `(#t #f)` from the above program is 0.25.

Even for very complicated programs we can predict the probability of different outcomes by simulating (sampling from) the program. It is also often useful to compute these probabilties directly by reasoning about the sampling process.

### Product Rule

In the above example we take three steps to compute the output value: we sample from the first `(flip)`, then from the second, then we make a list from their values. To make this more clear let us re-write the program as:
We can directly observe (as we did above) that the probability of `#t` for `A` is 0.5, and the probability of `#f` from `B` is 0.5. Can we use these two probabilities to arrive at the probability of 0.25 for the overall outcome `C=(#t #f)`? Yes, using the *product rule* of probabilities:

The probability of two random choices is the product of their individual probabilities.

The probability of several random choices together is often called the *joint probability* and written as <math>P(A,B)</math>.
Since the first and second random choices must each have their specified values in order to get `(#t #f)` in the example, the joint probability is their product: 0.25.

We must be careful when applying this rule, since the probability of a choice can depend on the probabilities of previous choices. For instance, compute the probability of `(#t #f)` resulting from this program:

In general, the joint probability of two random choices <math>A</math> and <math>B</math> made sequentially, in that order, can be written as <math>P(A,B) = P(A) P(B|A)</math>. This is read as the product of the probability of <math>A</math> and the probability of "<math>B</math> given <math>A</math>", or "<math>B</math> conditioned on <math>A</math>" -- that is, the probability of making choice <math>B</math> given that choice <math>A</math> has been made in a certain way. Only when the second choice is independent of the first choice does this expression reduce to a product of the simple probabilities of each choice individually: <math>P(A,B) = P(A) P(B)</math>.

What is the relation between <math>P(A,B)</math> and <math>P(B,A)</math>, the joint probability of the same choices written in the opposite order? The only logically consistent definitions of probability require that these two probabilities be equal, so <math>P(A) P(B|A) = P(B) P(A|B)</math>. This is the basis of *Bayes' theorem*, which we will encounter later.

### Sum Rule or Marginalization

Now let's consider an example where we can't determine from the overall return value the sequence of random choices that were made:
We can sample from this program and determine that the probability of returning `#t` is about 0.75.

We cannot simply use the product rule to determine this probability because we don't know the sequence of random choices that led to this return value.
However we can notice that the program will return true if the two component choices are `#t,#t`, or `#t,#f`, or `#f,#t`. To combine these possibilities we use another rule for probabilities:

If there are two alternative sequences of choices that lead to the same return value, the probability of this return value is the sum of the probabilities of the sequences.

We can write this using probability notation as: <math>P(A) = \sum_{B} P(A,B)</math> , where we view <math>A</math> as the final value and <math>B</math> as the sequence of random choices on the way to that value.
Using the product rule we can determine that the probability in the example above is 0.25 for each sequence that leads to return value `#t`, then, by the sum rule, the probability of `#t` is 0.25+0.25+0.25=0.75.

Using the sum rule to compute the probability of a final value is called *marginalization*. From the point of view of sampling processes marginalization is simply ignoring (or not looking at) intermediate random values that are created on the way to a final return value. From the point of view of directly computing probabilities, marginalization is summing over all the possible "histories" that could lead to a return value.

## Building Functions and Distributions: `lambda`

The power of lambda calculus as a model of computation comes from the ability to make new functions. To do so, we use the `lambda` primitive. For example, we can construct a function that doubles any number it is applied to:

The general form of a lambda expression is:

(lambda arguments body)

The first sub-expression of the lambda, the arguments, is a list of symbols that tells us what the inputs to the function will be called; the second sub-expression, the body, tells us what to do with these inputs. The value which results from a lambda expression is called a *compound procedure*. When a compound procedure is applied to input values (e.g. when `double` was applied to `3`) we imagine identifying (also called *binding*) the argument variables with these inputs, then evaluating the body. As another simple example, consider this function:

We can also use `lambda` to construct more complex stochastic functions from the primitive ones. Here is a stochastic function that will only sometimes double its input:

In lambda calculus we can build procedures that manipulate any kind of value -- even other procedures. Here we define a function `twice` which takes a procedure and returns a new procedure that applies the original twice:

This ability to make *higher-order* functions is what makes the lambda calculus a universal model of computation.

Since we often want to assign names to functions, `(define (foo x) ...)` is short ("syntactic sugar") for `(define foo (lambda (x) ...))`. For example we could have written the previous example:

A lambda expression with an empty argument list, `(lambda () ...)`, is called a *thunk*: this is a function that takes no input arguments. If we apply a thunk (to no arguments!) we get a return value back, for example `(flip)`, and if we do this many times we can figure out the marginal probability of each return value. Thus a thunk is an object that represents a whole *probability distribution*. By using higher-order functions we can construct and manipulate probability distributions. A good example comes from coin flipping....

### Example: Flipping Coins

The following program defines a fair coin, and flips it 20 times:

This program defines a "trick" coin that comes up heads most of the time (95%), and flips it 20 times:

*Note on Church syntax:*

A common mistake when defining names for new functions and using them with higher-order functions like `repeat` is to confuse the name of a thunk with the name of a variable that stands for the output of a single function evaluation. For instance, why doesn't this work?
The higher-order function `repeat` requires as input a thunk, a procedure (or function) with no arguments, as returned by `lambda` in the examples above. Consider the difference in what is returned by these two code fragments:
`trick-coin-1` names a variable that is defined to be the result of evaluating the `(if ...)` expression a single time (either `h` or `t`), while below `trick-coin-2` names a thunk that can serve as the input to the higher-order function `repeat`.
The difference between these two programs becomes particularly subtle when using the `(define (name) ... )` syntax. Simply putting the name to be defined in parentheses turns a variable definition into a thunk definition:
To Church the last two definitions of `trick-coin-2` are the same -- both output a thunk -- although superficially the last one looks more similar to the variable definition that assigns `trick-coin-1` to a single value of `h` or `t`.

The higher-order function `make-coin` takes in a weight and outputs a function (a thunk) describing a coin with that weight. Then we can use `make-coin` to make the coins above, or others.

We can also define a higher-order function that takes a "coin" and "bends it":
Make sure you understand how the `bend` function works! Why are there an "extra" pair of parentheses outside each `make-coin` statement?

## Persistent Randomness: `mem`

Consider building a model with a set of objects that each have a randomly chosen property. For instance, describing the eye colors of a set of people:

The results of this generative process are clearly wrong: Bob's eye
color can change each time we ask about it! What we want is a model in
which eye color is random, but *persistent.* We can do this using
another Church primitive: `mem`. `mem` is a higher order
function: it takes a procedure and produces a *memoized* version of
the procedure.
When a stochastic procedure is memoized, it will
sample a random value on the **first** call, and then return that
same value when called with the same arguments thereafter. The
resulting memoized procedure has a persistent value within each "run"
of the generative model. For instance consider the equality of two
flips, and the equality of two memoized flips:

Now returning to the eye color example, we can represent the notion that eye color is random, but each person has a fixed eye color.

This type of modeling is called *random world* style (McAllester,
Milch, Goodman, 2008). Note that we don't have to specify ahead of
time the people whose eye color we will ask about: the distribution on
eye colors is implicitly defined over the infinite set of possible
people, but only constructed "lazily" when needed. Memoizing
stochastic functions thus provides a powerful toolkit to represent and
reason about an unbounded set of properties of an unbounded set of objects.
For instance, here we define a function `flip-n` that encodes the
outcome of the <math>n</math>th flip of a particular coin:

There are in principle a countably infinite number of such flips, each independent of all the others. But the outcome of the <math>n</math>th flip -- the 1st, the 12th, the 47th, or the 1548th -- once determined, will always have the same value. Here we define a function that encodes the outcome of the <math>n</math>th flip of the <math>m</math>th coin, a doubly infinite set of properties:

In traditional computer science memoization is an important technique
for optimizing programs by avoiding repeated work (see
Memoization as Optimization). In the probabilistic setting, such
as in Church, we can see this taken further and memoization
actually affects the *expressivity* of the language.

### Example: Bayesian Tug of War

Imagine a game of tug of war, where each person may be strong or weak, and may be lazy or not on each match. If a person is lazy they only pull with half their strength. The team that pulls hardest will win. We assume that half of all people are strong, and that on any match, each person has a 1 in 3 chance of being lazy. This program runs a tournament between several teams, mixing up players across teams. Can you guess who is strong or weak, looking at the tournament results?

Notice that `strength` is memoized because this is a property of a person true across many matches, while `lazy` isn't. Each time you run this program, however, a new "random world" will be created: people's strengths will be randomly re-generated and used in all the matches.

### Useful Higher-Order Functions

In the above example we have used several useful utility functions. `map` is a higher-order function that takes a procedure (here built using `lambda`) and applies it to each element of a list (here the list of team members). After getting the list of how much each team member is pulling, we want to add them up. The procedure `+` expects to be applied directly to some numbers, as in `(+ 1 2 3 4)`, not to a list of numbers; `(+ (1 2 3 4))` would produce an error. So we use the higher-order function `apply`, which applies a procedure to a list as if the list were direct arguments to the procedure. The standard functions `sum` and `product` can be easily defined in terms of `(apply + ...)` and `(apply * ...)`, respectively. Just to illustrate this:

Higher-order functions like `repeat`, `map`, `apply` (or `sum`) can be quite useful. Here we use them to visualize the number of heads we expect to see if we flip a weighted coin (weight = 0.8) 10 times. We'll repeat this experiment 1000 times and then use `hist` to visualize the results. Try varying the coin weight or the number of repetitions to see how the expected distribution changes.

The `map` higher-order function can be used to map a function of more than one argument over multiple lists, element by element. For example, here is the MATLAB "dot-star" function (or ".*") written using `map`:

- ↑ Cardone and Hindley, 2006. History of Lambda-calculus and Combinatory Logic. In Gabbay and Woods (eds.),
*Handbook of the History of Logic*, vol. 5. Elsevier. - ↑ Wikipedia article on LISP.
- ↑ Wikipedia article on Scheme.
- ↑ The branching construct,
`if`, is strictly not a function, because it does not evaluate all of its arguments, but instead*short-circuits*evaluating only the second or third. It has a value like any other function.