# Generative Models (ESSLLI)

*To return to the top level: ESSLLI Tutorial.*

## Contents |

## Defining Simple Generative Models Using the <math>\lambda</math>-calculus, LISP, and Church

As our formal model of computation we start with the <math>\lambda</math>-calculus, and its embodiment in the LISP family of programming languages. The <math>\lambda</math>-calculus is a formal system which was invented by Alonzo Church in the 1920's. ^{[1]} Church introduced the <math>\lambda</math>-calculus as a model and formalization of computation, that is, as a way of formalizing the notion of an effectively computable function. The the lambda calculus is a *universal* model of computation—it is conjectured to be equivalent to all other notions of classical computation (the <math>\lambda</math>-calculus was shown to have the same computational power as the Turing machine and vice versa by Alan Turing in his famous paper which introduced the Turing machine). It is remarkable that the <math>\lambda</math>-calculus is universal because it has only two basic operations: creating and applying functions.

In 1958 John McCarthy introduced LISP (**LIS**t **P**rocessing), a programming language based on the <math>\lambda</math>-calculus.^{[2]} Scheme is a variant of LISP developed by Guy L. Steele and Gerald Jay Sussman with particularly simple syntax and semantics. ^{[3]} (For a simple tutorial on the Scheme language, see [1].) We will use Scheme-style notation for the <math>\lambda</math>-calculus in this tutorial. The Church programming language—named in honor of Alonzo Church— is a generalization of Scheme which introduces the notion of probabilistic computation to the language. In the next few sections of this tutorial, we introduce the basics of Church programming and show how to use it to build generative models.

The <math>\lambda</math>-calculus formalizes the notion of computation using *functions.* Computation is performed in the <math>\lambda</math>-calculus by applying functions to arguments to compute results. Function application in Church looks like this:

Here we see the function `+` applied to two arguments: `1` and `1`. Note that in Church the name of the function comes *first*. This is sometimes called *Polish notation*. If you run this simple program above you will see the resulting *value* of this function application. `+` is a *deterministic* function. In Church, to the set of Scheme primitive deterministic functions, we add a set of random primitives implementing *random variables.* In Church these random primitive functions are called *Exchangeable Random Procedures* (XRPs). In Church, application of an XRP results in a *sample* from the distribution over the random variable defined by that XRP. The simplest XRP is `flip` which results in a (possibly biased) coin toss. Note that we will let this coin be labeled with `true` and `false` (which look like this, `#t` and `#f`, in the output) on its two sides rather than *heads* and *tails*.

Run this program a few times. You will get back a different sample on each execution. Also, notice the parentheses around `flip`. These are meaningful; they tell Church that you are asking for an application of the XRP `flip`. Delete the parentheses and see what happens. The giant object that was returned is the `flip` *procedure* object. Procedure is just another word for an actual implementation of some function.^{[4]} In Church, each time you run a program you get a sample from the random variable that the program defines. If you run the program many times, and collect the values in a histogram, you will reconstruct the probability distribution that the program defines over the random variable.

Here we have used the `repeat` procedure which takes a number of repetitions, <math>K</math>, and a distribution (in this case `flip`) and returns a set of <math>K</math> samples from that distribution. We have used the `hist` procedure to display the results of taking 1000 samples from `flip`. As you can see, the result is an approximately uniform distribution over `true` and `false`.

An important idea here is that `flip` can be thought of in two different ways. From one perspective `flip` is a procedure which computes samples from a fair coin. That it is it is a *sampler* from a certain probability distribution. From another perspective, `flip` is *itself* a characterization of the distribution over `true` and `false`. When we think about probabilistic programs we will often move back and forth between these two views, emphasizing either the sampling perspective or the distributional perspective. (With suitable restrictions this duality is complete: any Church program implicitly represents a distribution and any distribution can be represented by a Church program. See The Meaning of Probabilistic Programs for more details on this duality.)

A Church program is syntactically composed of nested *expressions.* Roughly speaking an expression is either a simple value, or anything between parentheses `()`. In a deterministic LISP, like Scheme, all expressions have a single fixed value which is computed by a process known as *evaluation*. For example, consider the following more complex expression:

This expression is composed of an `if` conditional and several sub-expressions.
(The branching expression, `if`, while strictly not a function (because it does not evaluate all of its arguments, but instead *short-circuits*), still has a value like any other function.)
`equal?` is a function which checks for *structural* equality between objects (as opposed to checking that they are the exact same object occupying the same spot in memory). It is a *predicate*—a function which returns `true` or `false`. By convention in Church, predicates end in the `?` character. The value of the `if` expression is the *symbol* `some-symbol`. Notice the single quote in front of `some-symbol`; this tells Church to treat what follows the quote as that literal sequence of character, in this case it tells Church that `some-symbol` is a symbol and not a variable which needs to be de-referenced. Note that symbols in Church are a different data type from *strings*, which look like this `"my-string"`.

If you run the `(if (equal? ...))` program a thousand times, you will get back the same answer, because it is deterministic. In general in Church, on the other hand, each *evaluation* of an expression can return a different value. So the value of the `if` expression above changes from evaluation to evaluation. In Church, every evaluation of an expression is interpreted as sampling from random variable. This is important, because, as we will see below, it means Church allows us to define distributions over unbounded collections of random variables. So far we have only seen the unbiased form of `flip`. `flip` also has weighted version which can flip biased coins. For example, `(flip .7)` flips a coin that is weighted <math>.7</math> `true` and <math>.3</math> `false`.

We often want to name objects in our programs so that they can be reused. This can be done with the `define` statement. `define` looks like this:

(define variable-name expression)

`variable-name` is a *symbol* and which is bound to the value of whatever `expression` evaluates to. When variables themselves are evaluated they return the value that they have been bound to.

If you run this program several times you will notice that it sometimes returns the symbol `some-symbol`. This is because when `flip` returns `false` the variable `some-variable` is evaluated, since it is bound to `some-symbol`, that value is returned.

### Example: Causal Models in Medical Diagnosis

Generative knowledge is often *causal* knowledge. As an example of how causal knowledge can be encoded in Church expressions, consider a simplified medical scenario:

This program generates random conditions for a patient in a doctor's office. It first specifies the base rates of two diseases the patient could have: lung cancer is rare while a cold is common, and there is an independent chance of having each disease. The program then specifies a process for generating a common symptom of these diseases -- an effect with two possible causes: The patient coughs if they have a cold or lung cancer (or both). Here is a slightly more complex causal model:

Now there are four possible diseases and four symptoms. Each disease causes a different pattern of symptoms. The causal relations are now probabilistic: Only some patients with a cold have a cough (50%), or a fever (30%). There is also a catch-all disease category "something else", which has a low probability of causing any symptom. *Noisy logical* functions, or functions built from `and`, `or`, and `flip`, provide a simple but expressive way to describe probabilistic causal dependencies between Boolean (true-false valued) variables.

When you run the above code, the program generates a list of symptoms for a hypothetical patient. Most likely all the symptoms will be false, as (thankfully) each of these diseases is rare. Experiment with running the program multiple times. Now try modifying the `define` statement for one of the diseases, setting it to be true, to simulate only patients known to have that disease. For example, replace `(define lung-cancer (flip 0.01))` with `(define lung-cancer true)`. Run the program several times to observe the characteristic patterns of symptoms for that disease.

### Building Functions and Distributions: `lambda`

The power of lambda calculus as a model of computation comes from the ability to make new functions. To do so, we use the `lambda` primitive. For example, we can construct a function that doubles any number it is applied to:

The general form of a lambda expression is:

(lambda arguments body)

The first sub-expression of the lambda, the arguments, is a list of symbols that tells us what the inputs to the function will be called; the second sub-expression, the body, tells us what to do with these inputs.

We can also use `lambda` to construct more complex stochastic functions from the primitive ones. Here is a stochastic function that will only sometimes double it's input:

In lambda calculus we can build procedures that manipulate any kind of value -- even other procedures. Here we define a function `twice` which takes a procedure and returns a new procedure that applies the original twice:

This ability to make *higher-order* functions is what makes the lambda calculus a universal model of computation.

Since we often want to assign names to functions, `(define (foo x) ...)` is short (syntactic sugar) for `(define foo (lambda (x) ...))`.

A lambda expression with an empty argument list, `(lambda () ...)`, is called a *thunk*. Using thunks we can define objects representing stochastic generating processes. A familiar example comes in coin flipping.

### Example: Flipping Coins

The following program defines a fair coin, and flips it 20 times:

This program defines a "trick" coin that comes up heads most of the time (95%), and flips it 20 times:

The higher-order function `make-coin` takes in a weight and outputs a function (a thunk) describing a coin with that weight. Then we can use `make-coin` to make the coins above, or others.

### Persistent Randomness: `mem`

Consider building a model with a set of objects that each have a randomly chosen property. For instance, describing the eye colors of a set of people:

The results of this generative process are clearly wrong: Bob's eye
color can change each time we ask about it! What we want is a model in
which eye color is random, but *persistent.* We can do this using
another Church primitive: `mem`. `mem` is a higher order
function: it takes a procedure and produces a *memoized* version of
the procedure.
When a stochastic procedure is memoized, it will
sample a random value on the **first** call, and then return that
same value when called with the same arguments thereafter. The
resulting memoized procedure has a persistent value within each "run"
of the generative model. For instance consider the equality of two
flips, and the equality of two memoized flips:

Now returning to the eye color example, we can represent the notion that eye color is random, but each person has a fixed eye color.

This type of modeling is called *random world* style (McAllester,
Milch, Goodman, 2008). Note that we don't have to specify ahead of
time the people whose eye color we will ask about: the distribution on
eye colors is implicitly defined over the infinite set of possible
people, but only constructed "lazily" when needed.

In traditional computer science memoization is an important technique
for optimizing programs by avoiding repeated work (see
Memoization as Optimization). In the probabilistic setting, such
as in Church, we can see that this taken further and memoization
actually affects the *expressivity* of the language.

### Example: Bayesian Tug of War

Imagine a game of tug of war, where each person may be strong or weak, and may be lazy or not on each match. If a person is lazy they only pull with half their strength. The team that pulls hardest will win. We assume that half of all people are strong, and that on any match, each person has a 1 in 3 chance of being lazy. This program runs a tournament between several teams, mixing up players across teams. Can you guess who is strong or weak, looking at the tournament results?

Notice that `strength` is memoized because this is a property of a person true across many matches, while `lazy` isn't. Each time you run this program, however, a new "random world" will be created: people's strengths will be randomly re-generated and used in all the matches.

### Useful Higher-Order Functions

In the above example we have used several useful utility functions. `map` is a higher-order function that takes a procedure (here built using `lambda`) and applies it to each element of a list (here the list of team members). After getting the list of how much each team member is pulling, we want to add them up. The procedure `+` expects to be applied directly to some numbers, as in `(+ 1 2 3 4)`, not to a list of numbers; `(+ (1 2 3 4))` would produce an error. So we use the higher-order function `apply`, which applies a procedure to a list as if the list were direct arguments to the procedure. The standard functions `sum` and `product` can be easily defined in terms of `(apply + ...)` and `(apply * ...)`, respectively. Just to illustrate this:

Higher-order functions like `repeat`, `map`, `apply` (or `sum`) can be quite useful. Here we use them to visualize the number of heads we expect to see if we flip a weighted coin (weight = 0.8) 10 times. We'll repeat this experiment 1000 times and then use `truehist` (a variant of `hist` for numerical values) to visualize the results. Try varying the coin weight or the number of repetitions to see how the expected distribution changes.

- ↑ Cardone and Hindley, 2006. History of Lambda-calculus and Combinatory Logic. In Gabbay and Woods (eds.),
*Handbook of the History of Logic*, vol. 5. Elsevier. - ↑ Wikipedia article on LISP.
- ↑ Wikipedia article on Scheme.
- ↑ Technically two mathematical functions are the same if they define the same input-output mappings. Two procedures can still be different if they "work in different ways."