Recursive Models (ESSLLI)
- To return to the top level: ESSLLI Tutorial.
We have so far seen many bounded models: models with a finite set of random choices. In these models it is possible to enumerate all of the random choices that could have been made, even though we might not bother to make any given random choice during a given evaluation—for instance
(if (flip) 3 (flip 0.1)) often won't make the
(flip 0.1) choice, but we know that there are two possible choices. In this section we will explore models that have an infinite set of possible random choices, though only finitely many will be made in any given evaluation. These models are defined using recursion.
Since a program is finite, unbounded complexity requires that a single expression represent many computations. For instance, in the program:
(define coin (lambda () (flip 0.1))) (repeat 10 coin)
the lambda expression defining
coin describes the ten computations constructed by repeat. To do this more explicitly, we can write a function that calls itself inside its own body—a recursive function. For instance, we can re-write the above program using a recursive function:
tosses is called ten times, decreasing the counter
K each time until it reaches its base case, 0. As another example, consider computing the length of a list:
There are a number of things to note about this example. First we have used the null? procedure for the base case of the recursion—null? tests to see if a list is empty. This recursion loops though the list, one element at a time, adding one to the result of calling itself on the rest of the list. When it gets to the end of the list it returns 0.
We have already seen several uses of the repeat procedure. This procedure can be used to build up a list of (independent) samples from some thunk. How can this procedure be implemented in Church? Repeat is what is called a higher-order procedure: a procedure that takes another procedure as an argument, or returns it as a value.
One of the most powerful aspects of the LISP family of programming languages is its set of high-order list manipulating functions, such as repeat, map, and fold, which can be written via recursion. For a deeper understanding of these please see Deterministic List Manipulating Functions.
Example: Geometric Distributions
In the above functions the recursion deterministically halted when a condition was met. It is also possible to have a stochastic recursion that randomly decides whether to stop. Importantly, such recursion must be constructed to halt eventually (with probability 1). For example, an important probability distribution is the geometric distribution. The geometric distribution is a distribution over the non-negative integers that represents the probability of flipping a coin <math>N</math> times and getting exactly 1 head. This distribution can be written in Church with the following simple recursion.
Notice that the base case of the recursion is probabilistic. There is no upper bound on how long the computation can go on, although the probability of reaching some number declines quickly as we walk out on the number line.
A Syntactic Version of Bayes Occam's Razor
We discussed the Bayesian Occam's razor above. This can be seen as a preference for hypotheses which are intrinsically simpler: a hypothesis with less flexibility is preferred if it fits the data equally well. Because this depends on the distribution intrinsic to the hypothesis, we might call this a preference for "semantic" simplicity. When dealing with hypotheses of unbounded representational complexity there is also a "syntactic" simplicity preference: a preference for hypotheses with simpler representation. This emerges from a generative model for hypotheses, rather than from the hypotheses themselves, because of the need to generate more complex hypotheses using more choices. The probability of a particular computation in Church is the product of the probability of all the XRP choices during that computation. All else being equal, large, complex structures generated by more probabilistic choices (more XRP evaluations) will be less probable than small structures generated by fewer random choices. Thus it is a side effect of using stochastic recursion to build complex hypotheses, and the Bayesian Occam's razor, to penalize more syntactically complex hypotheses.
Example: Inferring an Arithmetic Expression
To illustrate this effect, consider the following Church program, which induces an arithmetic function from examples. We generate an expression as a list, and then turn it into a value (in this case a procedure) by using eval—a function that invokes evaluation.
The query asks for an arithmetic expression on variable x such that it evaluates to 3 when x=1. In this example there are many extensionally equivalent ways to satisfy the condition, for instance the constant expressions 3 and (+ 1 2), but because the more complex expressions require more choices to generate, they are chosen less often.
This query has a very "strict" condition: the function must give 3 when applied to 1. As the amount of data increases this strictness will make inference increasingly hard. We can ease inference by relaxing the condition, only requiring equality with high probability. To do so we use a "noisy" equality in the condition:
This is an example of a very powerful technique in probabilistic programing: a difficult inference problem can often be relaxed into an easier problem by inserting a noisy operation. Such a relaxation will have a parameter (the noise parameter), and various "temperature" techniques can be used to get samples from the original problem, using samples from the relaxed problem. (Temperature techniques that have been implemented for Church include parallel tempering, tempered transitions, and annealed importance sampling.)
Example: Learning Compositional Concepts
How can we account for the productivity of human concepts (the fact that every child learns a remarkable number of different, complex concepts)? The "classical" theory of concepts formation accounted for this productivity by hypothesizing that concepts are represented compositionally, by logical combination of the features of objects (see for example Bruner, Goodnow, and Austin, 1951). That is, concepts could be thought of as rules for classifying objects (in or out of the concept) and concept learning was a process of deducing the correct rule.
While this theory was appealing for many reasons, it failed to account for a variety of categorization experiments. Here are the training examples, and one transfer example, from the classic experiment of Medin and Schaffer (1978). The bar graph above the stimuli shows the portion of human participants who said that bug was a "fep" in the test phase (the data comes from a replication by Nosofsky, Gluck, Palmeri, McKinley (1994); the bug stimuli are courtesy of Pat Shafto):
Notice three effects: there is a gradient of generalization (rather than all-or-nothing classification), some of the Feps are better (or more typical) than others (this is called "typicality"), and the transfer item is a better Fep than any of the Fep exemplars (this is called "prototype enhancement"). Effects like these were difficult to capture with classical rule-based models of category learning, which led to deterministic behavior. As a result of such difficulties, psychological models of category learning turned to more uncertain, prototype and exemplar based theories of concept representation. These models were able to predict behavioral data very well, but lacked compositional conceptual structure.
Is it possible to get graded effects from rule-based concepts? Perhaps these effects are driven by uncertainty in learning rather than uncertainty in the representations themselves? To explore these questions Goodman, Tenenbaum, Feldman, and Griffiths (2008) introduced the Rational Rules model, which learns deterministic rules by probabilistic inference. This model has an infinite hypothesis space of rules (represented in propositional logic), which are generated compositionally. Here is a slightly simplified version of the model, applied to the above experiment:
Goodman, et al, have used to this model to capture a variety of classic categorization effects. Thus probabilistic induction of (deterministic) rules can capture many of the graded effects previously taken as evidence against rule-based models.
This style of compositional concept induction model, can be naturally extended to more complex hypothesis spaces. It has been used to model theory acquisition, learning natural numbers concepts, etc. Further, there is no reason that the concepts need to be deterministic: in Church stochastic functions can be constructed compositionally and learned by induction.
One area in which recursively defined generative processes are used extensively is in language models. In the next few sections we will explore a series of language models of increasing complexity.
N-gram and Hierarchical N-gram Models
Many systems in computational linguistics make use of n-gram models. An n-gram model is a drastically simplified model of sentence structure where the distribution over words in a particular position depends on the last <math>(N-1)</math> words. N-gram models are a simple example of an unbounded model where the structure of the computation itself depends on probabilistic choices. Here is the Church code for a simple bigram model, also called a Markov model, where the next word depends on the last.
Here we have used a case statement, which is another Church branching construct, see the case statement. sample-words is a recursion which builds up a list of words by sampling the next word conditional on the last. Each word is sampled from a multinomial distribution whose parameters are fixed. We start the recursion by sampling conditional on the special symbol start. When we sample the symbol stop we end the recursion. Like the geometric distribution defined above, this stochastic recursion can produce unbounded structures—in this case lists of words of arbitrary length.
The above code may seem unnecessarily complex because it explicitly lists every transition probability. Suppose that we put a prior distribution on the multinomial transitions in the n-gram model. Using mem this is very straightforward:
We now have a hierarchical n-gram model, which is both simpler to write and more powerful in the sense that it can learn the transition probabilities (rather than requiring them to be given).
Hidden Markov Models
Another popular model in computational linguistics is the hidden Markov model (HMM). The HMM extends the Markov model by assuming that the "actual" states aren't observable. Instead there is an observation model that generates an observation from each "hidden state". We use the same construction as above to generate an unknown observation model.
Probabilistic Context-free Grammars
The models above generate sequences of words, but lack constituent structure (or "hierarchical structure" in the linguistic sense).
Probabilistic context-free grammars (PCFGs) are a straightforward (and canonical) way to generate sequences of words with constituent structure. There are many ways to write a PCFG in Church. One especially direct way (inspired by Prolog programming) is to let non-terminals be represented by Church procedures, with constituency being embodies by one procedure calling another—that is by causal dependence.
We have used the procedure sample, which applies a thunk (to no arguments), resulting in a sample. sample is in fact trivial—it can be defined by:
(define (sample distribution) (apply distribution '()))
Now, let's look at one of the procedures defining our PCFG in detail.
(define VP (lambda () (map sample (multinomial (list (list V AP) (list V NP)) (list (/ 1 2) (/ 1 2))))))
When VP is called it maps sample across a list which is sampled from a multinomial distribution: in this case either (V AP) or (V NP). These two lists correspond to the right-hand sides (RHSs) of the rules <math>VP \longrightarrow V AP</math> and <math>VP \longrightarrow V NP</math> in the standard representation of PCFGs. These are lists that consist of symbols which are actually the names of other procedures. Therefore when sample is applied to them, they themselves recursively sample their RHSs until no more recursion can take place. Note that we have wrapped our terminals up in thunks so that when they are sampled they return the terminal symbol.
While it is most common to use PCFGs as models of strings (for linguistic applications), they can be useful as components of any probabilistic model where constituent structure is required. For instance, if you examine the Rational Rules model above, you will note that the hypothesis space of rules is generated from a PCFG—here the constituent structure is used to ensure that the parts of rules combine together into meaningful rules, that is, to provide a compositional semantics.
LISP is famous for its higher-order list manipulating functions (which you can read about here: Deterministic List Manipulating Functions.) These functions extract common patterns of recursion, resulting in clearer more parsimonious code.
One particular function, called fold is especially powerful: it can be used to do any list-based recursion. In the probabilistic setting there exists an important related procedure called unfold which recursively builds up lists. unfold takes three arguments. The first is some object, the second is a transition function, which returns the next object given the last one. The last argument is a predicate that stops the recursion.
(define (unfold current transition stop?) (if (stop? current) '() (pair current (unfold (transition current) transition stop?))))
With unfold defined we can now refactor our bigram model.
One thing to note about the PCFG example is that the each of the procedures is nearly identical. This suggests that we could write the grammar more succinctly. It turns out that there is a generalization of unfold called tree-unfold which will do the trick. Using tree-unfold we can rewrite our PCFG in a way that abstracts out the recursive structure, and looks much more like the standard notation for PCFGs:
Exercise: write a version of the preceding PCFG that draws the RHS distributions from a Dirichlet distribution.