# Generative Models (ESSLLI)

## Defining Simple Generative Models Using the $\lambda$-calculus, LISP, and Church

As our formal model of computation we start with the $\lambda$-calculus, and its embodiment in the LISP family of programming languages. The $\lambda$-calculus is a formal system which was invented by Alonzo Church in the 1920's. [1] Church introduced the $\lambda$-calculus as a model and formalization of computation, that is, as a way of formalizing the notion of an effectively computable function. The the lambda calculus is a universal model of computation—it is conjectured to be equivalent to all other notions of classical computation (the $\lambda$-calculus was shown to have the same computational power as the Turing machine and vice versa by Alan Turing in his famous paper which introduced the Turing machine). It is remarkable that the $\lambda$-calculus is universal because it has only two basic operations: creating and applying functions.

In 1958 John McCarthy introduced LISP (LISt Processing), a programming language based on the $\lambda$-calculus.[2] Scheme is a variant of LISP developed by Guy L. Steele and Gerald Jay Sussman with particularly simple syntax and semantics. [3] (For a simple tutorial on the Scheme language, see [1].) We will use Scheme-style notation for the $\lambda$-calculus in this tutorial. The Church programming language—named in honor of Alonzo Church— is a generalization of Scheme which introduces the notion of probabilistic computation to the language. In the next few sections of this tutorial, we introduce the basics of Church programming and show how to use it to build generative models.

The $\lambda$-calculus formalizes the notion of computation using functions. Computation is performed in the $\lambda$-calculus by applying functions to arguments to compute results. Function application in Church looks like this:

Here we see the function + applied to two arguments: 1 and 1. Note that in Church the name of the function comes first. This is sometimes called Polish notation. If you run this simple program above you will see the resulting value of this function application. + is a deterministic function. In Church, to the set of Scheme primitive deterministic functions, we add a set of random primitives implementing random variables. In Church these random primitive functions are called Exchangeable Random Procedures (XRPs). In Church, application of an XRP results in a sample from the distribution over the random variable defined by that XRP. The simplest XRP is flip which results in a (possibly biased) coin toss. Note that we will let this coin be labeled with true and false (which look like this, #t and #f, in the output) on its two sides rather than heads and tails.

Run this program a few times. You will get back a different sample on each execution. Also, notice the parentheses around flip. These are meaningful; they tell Church that you are asking for an application of the XRP flip. Delete the parentheses and see what happens. The giant object that was returned is the flip procedure object. Procedure is just another word for an actual implementation of some function.[4] In Church, each time you run a program you get a sample from the random variable that the program defines. If you run the program many times, and collect the values in a histogram, you will reconstruct the probability distribution that the program defines over the random variable.

Here we have used the repeat procedure which takes a number of repetitions, $K$, and a distribution (in this case flip) and returns a set of $K$ samples from that distribution. We have used the hist procedure to display the results of taking 1000 samples from flip. As you can see, the result is an approximately uniform distribution over true and false.

An important idea here is that flip can be thought of in two different ways. From one perspective flip is a procedure which computes samples from a fair coin. That it is it is a sampler from a certain probability distribution. From another perspective, flip is itself a characterization of the distribution over true and false. When we think about probabilistic programs we will often move back and forth between these two views, emphasizing either the sampling perspective or the distributional perspective. (With suitable restrictions this duality is complete: any Church program implicitly represents a distribution and any distribution can be represented by a Church program. See The Meaning of Probabilistic Programs for more details on this duality.)

A Church program is syntactically composed of nested expressions. Roughly speaking an expression is either a simple value, or anything between parentheses (). In a deterministic LISP, like Scheme, all expressions have a single fixed value which is computed by a process known as evaluation. For example, consider the following more complex expression:

This expression is composed of an if conditional and several sub-expressions. (The branching expression, if, while strictly not a function (because it does not evaluate all of its arguments, but instead short-circuits), still has a value like any other function.) equal? is a function which checks for structural equality between objects (as opposed to checking that they are the exact same object occupying the same spot in memory). It is a predicate—a function which returns true or false. By convention in Church, predicates end in the ? character. The value of the if expression is the symbol some-symbol. Notice the single quote in front of some-symbol; this tells Church to treat what follows the quote as that literal sequence of character, in this case it tells Church that some-symbol is a symbol and not a variable which needs to be de-referenced. Note that symbols in Church are a different data type from strings, which look like this "my-string".

If you run the (if (equal? ...)) program a thousand times, you will get back the same answer, because it is deterministic. In general in Church, on the other hand, each evaluation of an expression can return a different value. So the value of the if expression above changes from evaluation to evaluation. In Church, every evaluation of an expression is interpreted as sampling from random variable. This is important, because, as we will see below, it means Church allows us to define distributions over unbounded collections of random variables. So far we have only seen the unbiased form of flip. flip also has weighted version which can flip biased coins. For example, (flip .7) flips a coin that is weighted $.7$ true and $.3$ false.

We often want to name objects in our programs so that they can be reused. This can be done with the define statement. define looks like this:

(define variable-name expression)

variable-name is a symbol and which is bound to the value of whatever expression evaluates to. When variables themselves are evaluated they return the value that they have been bound to.

If you run this program several times you will notice that it sometimes returns the symbol some-symbol. This is because when flip returns false the variable some-variable is evaluated, since it is bound to some-symbol, that value is returned.

### Example: Causal Models in Medical Diagnosis

Generative knowledge is often causal knowledge. As an example of how causal knowledge can be encoded in Church expressions, consider a simplified medical scenario:

This program generates random conditions for a patient in a doctor's office. It first specifies the base rates of two diseases the patient could have: lung cancer is rare while a cold is common, and there is an independent chance of having each disease. The program then specifies a process for generating a common symptom of these diseases -- an effect with two possible causes: The patient coughs if they have a cold or lung cancer (or both). Here is a slightly more complex causal model:

Now there are four possible diseases and four symptoms. Each disease causes a different pattern of symptoms. The causal relations are now probabilistic: Only some patients with a cold have a cough (50%), or a fever (30%). There is also a catch-all disease category "something else", which has a low probability of causing any symptom. Noisy logical functions, or functions built from and, or, and flip, provide a simple but expressive way to describe probabilistic causal dependencies between Boolean (true-false valued) variables.

When you run the above code, the program generates a list of symptoms for a hypothetical patient. Most likely all the symptoms will be false, as (thankfully) each of these diseases is rare. Experiment with running the program multiple times. Now try modifying the define statement for one of the diseases, setting it to be true, to simulate only patients known to have that disease. For example, replace (define lung-cancer (flip 0.01)) with (define lung-cancer true). Run the program several times to observe the characteristic patterns of symptoms for that disease.

### Building Functions and Distributions: lambda

The power of lambda calculus as a model of computation comes from the ability to make new functions. To do so, we use the lambda primitive. For example, we can construct a function that doubles any number it is applied to:

The general form of a lambda expression is:

(lambda arguments body)


The first sub-expression of the lambda, the arguments, is a list of symbols that tells us what the inputs to the function will be called; the second sub-expression, the body, tells us what to do with these inputs.

We can also use lambda to construct more complex stochastic functions from the primitive ones. Here is a stochastic function that will only sometimes double it's input:

In lambda calculus we can build procedures that manipulate any kind of value -- even other procedures. Here we define a function twice which takes a procedure and returns a new procedure that applies the original twice:

This ability to make higher-order functions is what makes the lambda calculus a universal model of computation.

Since we often want to assign names to functions, (define (foo x) ...) is short (syntactic sugar) for (define foo (lambda (x) ...)).

A lambda expression with an empty argument list, (lambda () ...), is called a thunk. Using thunks we can define objects representing stochastic generating processes. A familiar example comes in coin flipping.

### Example: Flipping Coins

The following program defines a fair coin, and flips it 20 times:

This program defines a "trick" coin that comes up heads most of the time (95%), and flips it 20 times:

The higher-order function make-coin takes in a weight and outputs a function (a thunk) describing a coin with that weight. Then we can use make-coin to make the coins above, or others.

### Persistent Randomness: mem

Consider building a model with a set of objects that each have a randomly chosen property. For instance, describing the eye colors of a set of people:

The results of this generative process are clearly wrong: Bob's eye color can change each time we ask about it! What we want is a model in which eye color is random, but persistent. We can do this using another Church primitive: mem. mem is a higher order function: it takes a procedure and produces a memoized version of the procedure. When a stochastic procedure is memoized, it will sample a random value on the first call, and then return that same value when called with the same arguments thereafter. The resulting memoized procedure has a persistent value within each "run" of the generative model. For instance consider the equality of two flips, and the equality of two memoized flips:

Now returning to the eye color example, we can represent the notion that eye color is random, but each person has a fixed eye color.

This type of modeling is called random world style (McAllester, Milch, Goodman, 2008). Note that we don't have to specify ahead of time the people whose eye color we will ask about: the distribution on eye colors is implicitly defined over the infinite set of possible people, but only constructed "lazily" when needed.

In traditional computer science memoization is an important technique for optimizing programs by avoiding repeated work (see Memoization as Optimization). In the probabilistic setting, such as in Church, we can see that this taken further and memoization actually affects the expressivity of the language.

### Example: Bayesian Tug of War

Imagine a game of tug of war, where each person may be strong or weak, and may be lazy or not on each match. If a person is lazy they only pull with half their strength. The team that pulls hardest will win. We assume that half of all people are strong, and that on any match, each person has a 1 in 3 chance of being lazy. This program runs a tournament between several teams, mixing up players across teams. Can you guess who is strong or weak, looking at the tournament results?

Notice that strength is memoized because this is a property of a person true across many matches, while lazy isn't. Each time you run this program, however, a new "random world" will be created: people's strengths will be randomly re-generated and used in all the matches.

### Useful Higher-Order Functions

In the above example we have used several useful utility functions. map is a higher-order function that takes a procedure (here built using lambda) and applies it to each element of a list (here the list of team members). After getting the list of how much each team member is pulling, we want to add them up. The procedure + expects to be applied directly to some numbers, as in (+ 1 2 3 4), not to a list of numbers; (+ (1 2 3 4)) would produce an error. So we use the higher-order function apply, which applies a procedure to a list as if the list were direct arguments to the procedure. The standard functions sum and product can be easily defined in terms of (apply + ...) and (apply * ...), respectively. Just to illustrate this:

Higher-order functions like repeat, map, apply (or sum) can be quite useful. Here we use them to visualize the number of heads we expect to see if we flip a weighted coin (weight = 0.8) 10 times. We'll repeat this experiment 1000 times and then use truehist (a variant of hist for numerical values) to visualize the results. Try varying the coin weight or the number of repetitions to see how the expected distribution changes.

1. Cardone and Hindley, 2006. History of Lambda-calculus and Combinatory Logic. In Gabbay and Woods (eds.), Handbook of the History of Logic, vol. 5. Elsevier.
2. Wikipedia article on LISP.
3. Wikipedia article on Scheme.
4. Technically two mathematical functions are the same if they define the same input-output mappings. Two procedures can still be different if they "work in different ways."