Generative Models

From Church Wiki
Jump to: navigation, search
To return to the top level: Probabilistic Models of Cognition.
To see useful variations on the examples from this page: Generative Models: Variations.

Contents

Defining Simple Generative Models

As our formal model of computation we start with the <math>\lambda</math>-calculus, and its embodiment in the LISP family of programming languages. The <math>\lambda</math>-calculus is a formal system which was invented by Alonzo Church in the 1920's.[1] Church introduced the <math>\lambda</math>-calculus as a model and formalization of computation, that is, as a way of formalizing the notion of an effectively computable function. The lambda calculus is a universal model of computation—it is conjectured to be equivalent to all other notions of classical computation (the <math>\lambda</math>-calculus was shown to have the same computational power as the Turing machine and vice versa by Alan Turing in his famous paper which introduced the Turing machine). It is remarkable that the <math>\lambda</math>-calculus is universal because it has only two basic operations: creating and applying functions.

In 1958 John McCarthy introduced LISP (LISt Processing), a programming language based on the <math>\lambda</math>-calculus.[2] Scheme is a variant of LISP developed by Guy L. Steele and Gerald Jay Sussman with particularly simple syntax and semantics.[3] (For a simple tutorial on the Scheme language, see [1] or [2].) We will use Scheme-style notation for the <math>\lambda</math>-calculus in this tutorial. The Church programming language—named in honor of Alonzo Church—is a generalization of Scheme which introduces the notion of probabilistic computation to the language. In the next few sections of this tutorial, we introduce the basics of Church programming and show how to use it to build generative models.

The <math>\lambda</math>-calculus formalizes the notion of computation using functions. Computation is performed in the <math>\lambda</math>-calculus by applying functions to arguments to compute results. Function application in Church looks like this:

Here we see the function + applied to two arguments: 1 and 1. Note that in Church the name of the function comes first and the parentheses go outside the function. This is sometimes called Polish notation. If you run this simple program above you will see the resulting value of this function application. Since + is a deterministic function you will always get the same return value if you run the program many times.

In Church, in addition to deterministic functions, we have a set of random functions implementing random choices. These random primitive functions are called Exchangeable Random Primitives (XRPs). Application of an XRP results in a sample from the probability distribution defined by that XRP. For example, the simplest XRP is flip which results in either true or false—it simulates a (possibly biased) coin toss. (Note that the return values true and false will look like this in the output: #t and #f.)

Run this program a few times. You will get back a different sample on each execution. Also, notice the parentheses around flip. These are meaningful; they tell Church that you are asking for an application of the XRP flip. Delete the parentheses and see what happens. The object that was returned is the flip procedure object. Procedure is just another word for an actual implementation of some function. In Church, each time you run a program you get a sample by simulating the computations and random choices that the program specifies. If you run the program many times, and collect the values in a histogram, you can see what a typical sample looks like:

Here we have used the repeat procedure which takes a number of repetitions, <math>K</math>, and a random distribution (in this case flip) and returns a list of <math>K</math> samples from that distribution. We have used the hist procedure to display the results of taking 1000 samples from flip. As you can see, the result is an approximately uniform distribution over true and false.

An important idea here is that flip can be thought of in two different ways. From one perspective, flip is a procedure which returns a sample from a fair coin. That is, it's a sampler or simulator. From another perspective, flip is itself a characterization of the distribution over true and false. When we think about probabilistic programs we will often move back and forth between these two views, emphasizing either the sampling perspective or the distributional perspective. (With suitable restrictions this duality is complete: any Church program implicitly represents a distribution and any distribution can be represented by a Church program. See The Meaning of Probabilistic Programs for more details on this duality.) We return to this relationship between probability and simulation below.

The flip function is the simplest XRP in Church, but you will find other XRPs corresponding to familiar probability distributions, such as gaussian, gamma, dirichlet, and so on.

Many but not all of the XRPs and other basic functions implemented in Church can be found on the Automatically generated documentation page.


Building More Complex Programs

A Church program is syntactically composed of nested expressions. Roughly speaking an expression is either a simple value, or anything between parentheses (). In a deterministic LISP, like Scheme, all expressions without free variables have a single fixed value which is computed by a process known as evaluation. For example, consider the following more complex expression:

This expression has an operator, the logical function and, and arguments, true and the subexpression which is itself an application of or. When reasoning about evaluation, it is best to think of evaluating the subexpressions first, then applying the function to the return values of its arguments. In this example or is first applied to true and false, returning true, then and is applied to true and the subexpression's return value, again returning true.

For another example, consider:

This expression is composed of an if conditional that evaluates the first expression (a flip here) then evaluates the second expression if the first is true or otherwise evaluates the third expression.[4] (We have also used comments here: anything after a semicolon is ignored when evaluating.) Notice that the first flip has an argument: flip with an argument is a biased random choice. In this case this flip will be true with probability 0.7.

We often want to name objects in our programs so that they can be reused. This can be done with the define statement. define looks like this:

(define variable-name expression)

variable-name is a symbol that is bound to the value that expression evaluates to. When variables themselves are evaluated they return the value that they have been bound to:

If you run this program several times you will notice that it sometimes returns 10. This is because when flip returns false the variable some-variable is evaluated, and since it is bound to 10, that value is returned.

Lists and symbols

There are several special kinds of values in Church. One kind of special value is a list: a sequence of other values. Lists can be built using the list function and manipulated using functions such as first (get the first element in a list) and rest (get the rest of the list after the first element). Experiment with different combinations of these functions. What happens when you apply list to the result of list? (In this program we have used another kind of special value: a string, which is anything between double-quotes.)

Lists are a fundamental data structure in LISP-like languages. In fact, a Church program is just a list—the list of expressions comprising the program.

Sometimes we wish to treat a piece of Church code "literally", that is, to view it as a value rather than an expression. We can do this using a special quote operation (which can also be written with a single quote: (quote ...) is the same as '...): What is the value of quoted-value? What happens if you remove the quote? Why?

If we quote the name of a variable, what we get back is a symbol: a value which is the literal variable, rather than the value it is bound to. A symbol is just an identifier; it will be equal to itself, but to no other symbol: The ability to make new symbols as needed is a crucial feature in building models that reason over unbounded worlds, as we'll see below.

Example: Causal Models in Medical Diagnosis

Generative knowledge is often causal knowledge. As an example of how causal knowledge can be encoded in Church expressions, consider a simplified medical scenario:

This program generates random conditions for a patient in a doctor's office. It first specifies the base rates of two diseases the patient could have: lung cancer is rare while a cold is common, and there is an independent chance of having each disease. The program then specifies a process for generating a common symptom of these diseases -- an effect with two possible causes: The patient coughs if they have a cold or lung cancer (or both). Here is a slightly more complex causal model:

Now there are four possible diseases and four symptoms. Each disease causes a different pattern of symptoms. The causal relations are now probabilistic: Only some patients with a cold have a cough (50%), or a fever (30%). There is also a catch-all disease category "other", which has a low probability of causing any symptom. Noisy logical functions, or functions built from and, or, and flip, provide a simple but expressive way to describe probabilistic causal dependencies between Boolean (true-false valued) variables.

When you run the above code, the program generates a list of symptoms for a hypothetical patient. Most likely all the symptoms will be false, as (thankfully) each of these diseases is rare. Experiment with running the program multiple times. Now try modifying the define statement for one of the diseases, setting it to be true, to simulate only patients known to have that disease. For example, replace (define lung-cancer (flip 0.01)) with (define lung-cancer true). Run the program several times to observe the characteristic patterns of symptoms for that disease.

Prediction, Simulation, and Probabilities

Suppose that we flip two fair coins, and return the list of their values: How can we predict the return value of this program? For instance, how likely is it that we will see (#t #f)? A probability is a number between 0 and 1 that expresses the answer to such a question: it is a degree of belief that we will see a given outcome, such as (#t #f). The probability if an event <math>A</math> (such as the above program returning (#t #f)) is usually written as: <math>P(A)</math>.

As we did above, we can sample many times and examine the histogram of return values: (The parentheses around random-pair make an object that can be sampled from many times in a single execution of the program; this is described below, but the details aren't important for now.)

We see by examining this histogram that (#t #f) comes out about 0.25 of the time. We may define the probability of a return value to be the fraction of times (in the long run) that this value is returned from evaluating the program—then the probability of (#t #f) from the above program is 0.25.

Even for very complicated programs we can predict the probability of different outcomes by simulating (sampling from) the program. It is also often useful to compute these probabilties directly by reasoning about the sampling process.

Product Rule

In the above example we take three steps to compute the output value: we sample from the first (flip), then from the second, then we make a list from their values. To make this more clear let us re-write the program as: We can directly observe (as we did above) that the probability of #t for A is 0.5, and the probability of #f from B is 0.5. Can we use these two probabilities to arrive at the probability of 0.25 for the overall outcome C=(#t #f)? Yes, using the product rule of probabilities:

The probability of two random choices is the product of their individual probabilities.

The probability of several random choices together is often called the joint probability and written as <math>P(A,B)</math>. Since the first and second random choices must each have their specified values in order to get (#t #f) in the example, the joint probability is their product: 0.25.

We must be careful when applying this rule, since the probability of a choice can depend on the probabilities of previous choices. For instance, compute the probability of (#t #f) resulting from this program:

In general, the joint probability of two random choices <math>A</math> and <math>B</math> made sequentially, in that order, can be written as <math>P(A,B) = P(A) P(B|A)</math>. This is read as the product of the probability of <math>A</math> and the probability of "<math>B</math> given <math>A</math>", or "<math>B</math> conditioned on <math>A</math>" -- that is, the probability of making choice <math>B</math> given that choice <math>A</math> has been made in a certain way. Only when the second choice is independent of the first choice does this expression reduce to a product of the simple probabilities of each choice individually: <math>P(A,B) = P(A) P(B)</math>.

What is the relation between <math>P(A,B)</math> and <math>P(B,A)</math>, the joint probability of the same choices written in the opposite order? The only logically consistent definitions of probability require that these two probabilities be equal, so <math>P(A) P(B|A) = P(B) P(A|B)</math>. This is the basis of Bayes' theorem, which we will encounter later.

Sum Rule or Marginalization

Now let's consider an example where we can't determine from the overall return value the sequence of random choices that were made: We can sample from this program and determine that the probability of returning #t is about 0.75.

We cannot simply use the product rule to determine this probability because we don't know the sequence of random choices that led to this return value. However we can notice that the program will return true if the two component choices are #t,#t, or #t,#f, or #f,#t. To combine these possibilities we use another rule for probabilities:

If there are two alternative sequences of choices that lead to the same return value, the probability of this return value is the sum of the probabilities of the sequences.

We can write this using probability notation as: <math>P(A) = \sum_{B} P(A,B)</math> , where we view <math>A</math> as the final value and <math>B</math> as the sequence of random choices on the way to that value. Using the product rule we can determine that the probability in the example above is 0.25 for each sequence that leads to return value #t, then, by the sum rule, the probability of #t is 0.25+0.25+0.25=0.75.

Using the sum rule to compute the probability of a final value is called marginalization. From the point of view of sampling processes marginalization is simply ignoring (or not looking at) intermediate random values that are created on the way to a final return value. From the point of view of directly computing probabilities, marginalization is summing over all the possible "histories" that could lead to a return value.

Building Functions and Distributions: lambda

The power of lambda calculus as a model of computation comes from the ability to make new functions. To do so, we use the lambda primitive. For example, we can construct a function that doubles any number it is applied to:

The general form of a lambda expression is:

(lambda arguments body)

The first sub-expression of the lambda, the arguments, is a list of symbols that tells us what the inputs to the function will be called; the second sub-expression, the body, tells us what to do with these inputs. The value which results from a lambda expression is called a compound procedure. When a compound procedure is applied to input values (e.g. when double was applied to 3) we imagine identifying (also called binding) the argument variables with these inputs, then evaluating the body. As another simple example, consider this function:

Sicp-lambda-diagram.png

We can also use lambda to construct more complex stochastic functions from the primitive ones. Here is a stochastic function that will only sometimes double its input:

In lambda calculus we can build procedures that manipulate any kind of value -- even other procedures. Here we define a function twice which takes a procedure and returns a new procedure that applies the original twice:

This ability to make higher-order functions is what makes the lambda calculus a universal model of computation.

Since we often want to assign names to functions, (define (foo x) ...) is short ("syntactic sugar") for (define foo (lambda (x) ...)). For example we could have written the previous example:

A lambda expression with an empty argument list, (lambda () ...), is called a thunk: this is a function that takes no input arguments. If we apply a thunk (to no arguments!) we get a return value back, for example (flip), and if we do this many times we can figure out the marginal probability of each return value. Thus a thunk is an object that represents a whole probability distribution. By using higher-order functions we can construct and manipulate probability distributions. A good example comes from coin flipping....


Example: Flipping Coins

The following program defines a fair coin, and flips it 20 times:


This program defines a "trick" coin that comes up heads most of the time (95%), and flips it 20 times:

Note on Church syntax:

A common mistake when defining names for new functions and using them with higher-order functions like repeat is to confuse the name of a thunk with the name of a variable that stands for the output of a single function evaluation. For instance, why doesn't this work? The higher-order function repeat requires as input a thunk, a procedure (or function) with no arguments, as returned by lambda in the examples above. Consider the difference in what is returned by these two code fragments: trick-coin-1 names a variable that is defined to be the result of evaluating the (if ...) expression a single time (either h or t), while below trick-coin-2 names a thunk that can serve as the input to the higher-order function repeat. The difference between these two programs becomes particularly subtle when using the (define (name) ... ) syntax. Simply putting the name to be defined in parentheses turns a variable definition into a thunk definition: To Church the last two definitions of trick-coin-2 are the same -- both output a thunk -- although superficially the last one looks more similar to the variable definition that assigns trick-coin-1 to a single value of h or t.

The higher-order function make-coin takes in a weight and outputs a function (a thunk) describing a coin with that weight. Then we can use make-coin to make the coins above, or others.

We can also define a higher-order function that takes a "coin" and "bends it": Make sure you understand how the bend function works! Why are there an "extra" pair of parentheses outside each make-coin statement?

Persistent Randomness: mem

Consider building a model with a set of objects that each have a randomly chosen property. For instance, describing the eye colors of a set of people:

The results of this generative process are clearly wrong: Bob's eye color can change each time we ask about it! What we want is a model in which eye color is random, but persistent. We can do this using another Church primitive: mem. mem is a higher order function: it takes a procedure and produces a memoized version of the procedure. When a stochastic procedure is memoized, it will sample a random value on the first call, and then return that same value when called with the same arguments thereafter. The resulting memoized procedure has a persistent value within each "run" of the generative model. For instance consider the equality of two flips, and the equality of two memoized flips:

Now returning to the eye color example, we can represent the notion that eye color is random, but each person has a fixed eye color.

This type of modeling is called random world style (McAllester, Milch, Goodman, 2008). Note that we don't have to specify ahead of time the people whose eye color we will ask about: the distribution on eye colors is implicitly defined over the infinite set of possible people, but only constructed "lazily" when needed. Memoizing stochastic functions thus provides a powerful toolkit to represent and reason about an unbounded set of properties of an unbounded set of objects. For instance, here we define a function flip-n that encodes the outcome of the <math>n</math>th flip of a particular coin:

There are in principle a countably infinite number of such flips, each independent of all the others. But the outcome of the <math>n</math>th flip -- the 1st, the 12th, the 47th, or the 1548th -- once determined, will always have the same value. Here we define a function that encodes the outcome of the <math>n</math>th flip of the <math>m</math>th coin, a doubly infinite set of properties:

In traditional computer science memoization is an important technique for optimizing programs by avoiding repeated work (see Memoization as Optimization). In the probabilistic setting, such as in Church, we can see this taken further and memoization actually affects the expressivity of the language.

Example: Bayesian Tug of War

Imagine a game of tug of war, where each person may be strong or weak, and may be lazy or not on each match. If a person is lazy they only pull with half their strength. The team that pulls hardest will win. We assume that half of all people are strong, and that on any match, each person has a 1 in 3 chance of being lazy. This program runs a tournament between several teams, mixing up players across teams. Can you guess who is strong or weak, looking at the tournament results?

Notice that strength is memoized because this is a property of a person true across many matches, while lazy isn't. Each time you run this program, however, a new "random world" will be created: people's strengths will be randomly re-generated and used in all the matches.

Useful Higher-Order Functions

In the above example we have used several useful utility functions. map is a higher-order function that takes a procedure (here built using lambda) and applies it to each element of a list (here the list of team members). After getting the list of how much each team member is pulling, we want to add them up. The procedure + expects to be applied directly to some numbers, as in (+ 1 2 3 4), not to a list of numbers; (+ (1 2 3 4)) would produce an error. So we use the higher-order function apply, which applies a procedure to a list as if the list were direct arguments to the procedure. The standard functions sum and product can be easily defined in terms of (apply + ...) and (apply * ...), respectively. Just to illustrate this:

Higher-order functions like repeat, map, apply (or sum) can be quite useful. Here we use them to visualize the number of heads we expect to see if we flip a weighted coin (weight = 0.8) 10 times. We'll repeat this experiment 1000 times and then use hist to visualize the results. Try varying the coin weight or the number of repetitions to see how the expected distribution changes.

The map higher-order function can be used to map a function of more than one argument over multiple lists, element by element. For example, here is the MATLAB "dot-star" function (or ".*") written using map:

  1. Cardone and Hindley, 2006. History of Lambda-calculus and Combinatory Logic. In Gabbay and Woods (eds.), Handbook of the History of Logic, vol. 5. Elsevier.
  2. Wikipedia article on LISP.
  3. Wikipedia article on Scheme.
  4. The branching construct, if, is strictly not a function, because it does not evaluate all of its arguments, but instead short-circuits evaluating only the second or third. It has a value like any other function.



Personal tools