- To return to the top level: Probabilistic Models of Cognition.
- "Language is that which makes infinite use of finite means." --Wilhelm von Humboldt
How can we explain the unboundedness of human mental representations? For instance, you could potentially create or comprehend infinitely many sentences in english. Yet all of these sentences are created out of a finite set of words. Furthermore, not every sentence is possible: there is structure to this unboundedness. The non-parametric models we saw in the previous section are not well suited to capturing unbounded but structured spaces, like natural language. In this section we explore more structured unbounded models, defined via recursion.
Since a program is finite, unbounded complexity requires that a single expression represent many computations. For instance, in the program:
(define coin (lambda () (flip 0.1))) (repeat 10 coin)
the lambda expression defining
coin describes the ten computations constructed by repeat. To do this more explicitly, we can write a function that calls itself inside its own body—a recursive function. For instance, we can re-write the above program using a recursive function:
tosses is called ten times, decreasing the counter
K each time until it reaches its base case, 0. As another example, consider computing the length of a list:
There are a number of things to note about this example. First we have used the null? procedure for the base case of the recursion—null? tests to see if a list is empty. This recursion loops though the list, one element at a time, adding one to the result of calling itself on the rest of the list. When it gets to the end of the list it returns 0.
We have already seen several uses of the repeat procedure. This procedure can be used to build up a list of (independent) samples from some thunk. How can this procedure be implemented in Church? Repeat is a higher-order recursive procedure:
One of the most powerful aspects of the LISP family of programming languages is its set of high-order list manipulating functions, such as repeat, map, and fold, which can be written via recursion. For a deeper understanding of these please see Deterministic List Manipulating Functions.
Example: Geometric Distributions
In the above functions the recursion deterministically halted when a condition was met. It is also possible to have a stochastic recursion that randomly decides whether to stop. Importantly, such recursion must be constructed to halt eventually (with probability 1). For example, an important probability distribution is the geometric distribution. The geometric distribution is a distribution over the non-negative integers that represents the probability of flipping a coin <math>N</math> times and getting exactly 1 head. This distribution can be written in Church with the following simple recursion.
Notice that the base case of the recursion is probabilistic. There is no upper bound on how long the computation can go on, although the probability of reaching some number declines quickly as we walk out on the number line.
A Syntactic Version of Bayes Occam's Razor
In previous sections, we discussed Bayesian Occam's razor, which can be seen as a preference for hypotheses which are intrinsically simpler: a hypothesis with less flexibility is preferred if it fits the data equally well. Because this depends on the distribution intrinsic to the hypothesis, we might call this a preference for "semantic" simplicity. When dealing with hypotheses of unbounded representational complexity there is also a "syntactic" simplicity preference: a preference for hypotheses with simpler representation. This emerges from a generative model for hypotheses, rather than from the hypotheses themselves, because of the need to generate more complex hypotheses using more choices. The probability of a particular computation in Church is the product of the probability of all the random choices during that computation. All else being equal, large, complex structures (generated by more random choices) will be less probable than small structures (generated by fewer random choices). Thus penalizing more syntactically complex hypotheses is a side effect of using stochastic recursion to build complex hypotheses, via the Bayesian Occam's razor.
Example: Inferring an Arithmetic Expression
To illustrate this effect, consider the following Church program, which induces an arithmetic function from examples. We generate an expression as a list, and then turn it into a value (in this case a procedure) by using eval—a function that invokes evaluation.
The query asks for an arithmetic expression on variable x such that it evaluates to 3 when x=1. In this example there are many extensionally equivalent ways to satisfy the condition, for instance the constant expressions 3 and (+ 1 2), but because the more complex expressions require more choices to generate, they are chosen less often.
This query has a very "strict" condition: the function must give 3 when applied to 1. As the amount of data increases this strictness will make inference increasingly hard. We can ease inference by relaxing the condition, only requiring equality with high probability. To do so we use a "noisy" equality in the condition:
Try adding in more data consistent with the (+ x 2) rule, e.g., (noisy= (my-proc 4) 6) , (noisy= (my-proc 9) 11) . How do the results of querying on the arithmetic expression change as more consistent data points are observed, and why?
This is an example of a very powerful technique in probabilistic programing: a difficult inference problem can often be relaxed into an easier problem by inserting a noisy operation. Such a relaxation will have a parameter (the noise parameter), and various "temperature" techniques can be used to get samples from the original problem, using samples from the relaxed problem. (Temperature techniques that have been implemented for Church include parallel tempering, tempered transitions, and annealed importance sampling.)
Example: Learning Compositional Concepts
How can we account for the productivity of human concepts (the fact that every child learns a remarkable number of different, complex concepts)? The "classical" theory of concepts formation accounted for this productivity by hypothesizing that concepts are represented compositionally, by logical combination of the features of objects (see for example Bruner, Goodnow, and Austin, 1951). That is, concepts could be thought of as rules for classifying objects (in or out of the concept) and concept learning was a process of deducing the correct rule.
While this theory was appealing for many reasons, it failed to account for a variety of categorization experiments. Here are the training examples, and one transfer example, from the classic experiment of Medin and Schaffer (1978). The bar graph above the stimuli shows the portion of human participants who said that bug was a "fep" in the test phase (the data comes from a replication by Nosofsky, Gluck, Palmeri, McKinley (1994); the bug stimuli are courtesy of Pat Shafto):
Notice three effects: there is a gradient of generalization (rather than all-or-nothing classification), some of the Feps are better (or more typical) than others (this is called "typicality"), and the transfer item is a better Fep than any of the Fep exemplars (this is called "prototype enhancement"). Effects like these were difficult to capture with classical rule-based models of category learning, which led to deterministic behavior. As a result of such difficulties, psychological models of category learning turned to more uncertain, prototype and exemplar based theories of concept representation. These models were able to predict behavioral data very well, but lacked compositional conceptual structure.
Is it possible to get graded effects from rule-based concepts? Perhaps these effects are driven by uncertainty in learning rather than uncertainty in the representations themselves? To explore these questions Goodman, Tenenbaum, Feldman, and Griffiths (2008) introduced the Rational Rules model, which learns deterministic rules by probabilistic inference. This model has an infinite hypothesis space of rules (represented in propositional logic), which are generated compositionally. Here is a slightly simplified version of the model, applied to the above experiment:
Goodman, et al, have used to this model to capture a variety of classic categorization effects . Thus probabilistic induction of (deterministic) rules can capture many of the graded effects previously taken as evidence against rule-based models.
This style of compositional concept induction model, can be naturally extended to more complex hypothesis spaces . It has been used to model theory acquisition, learning natural numbers concepts, etc. Further, there is no reason that the concepts need to be deterministic: in Church stochastic functions can be constructed compositionally and learned by induction .
One area in which recursively defined generative processes are used extensively is in models of natural language. The goal in these models is to capture the statistical structure in sequences of words. In the next few sections we will explore a series of increasingly complex models for such data.
N-gram and Hierarchical N-gram Models
Many systems in computational linguistics make use of n-gram models. An n-gram model is a drastically simplified model of sentence structure where the distribution over words in a particular position depends on the last <math>(N-1)</math> words. N-gram models are a simple example of an unbounded model where the structure of the computation itself depends on probabilistic choices. Here is the Church code for a simple bigram model, also called a Markov model, where the next word depends on the last.
Here we have used a case statement, which is another Church branching construct (a short form for nested ifs, see the case statement). sample-words is a recursion which builds up a list of words by sampling the next word conditional on the last. Each word is sampled from a multinomial distribution whose parameters are fixed. We start the recursion by sampling conditional on the special symbol start. When we sample the symbol stop we end the recursion. Like the geometric distribution defined above, this stochastic recursion can produce unbounded structures—in this case lists of words of arbitrary length.
The above code may seem unnecessarily complex because it explicitly lists every transition probability. Suppose that we put a prior distribution on the multinomial transitions in the n-gram model. Using mem this is very straightforward:
We now have a hierarchical n-gram model, which is both simpler to write and more powerful in the sense that it can learn the transition probabilities (rather than requiring them to be given).
Hidden Markov Models
Another popular model in computational linguistics is the hidden Markov model (HMM). The HMM extends the Markov model by assuming that the "actual" states aren't observable. Instead there is an observation model that generates an observation from each "hidden state". We use the same construction as above to generate an unknown observation model.
Infinite Hidden Markov Models
Just as when we considered the unknown latent categories, we may wish to have a hidden Markov model over an unknown number of latent symbols. We can do this by again using a reusable source of state symbols: This model is known as the infinite hidden Markov model. Notice how the transition model uses a separate DPmemoized function for each latent state: with some probability it will reuse a transition from this state, otherwise it will transition to a new state drawn from the globally shared source or state symbols—a DPmemoized gensym.
Probabilistic Context-free Grammars
The models above generate sequences of words, but lack constituent structure (or "hierarchical structure" in the linguistic sense).
Probabilistic context-free grammars (PCFGs) are a straightforward (and canonical) way to generate sequences of words with constituent structure. There are many ways to write a PCFG in Church. One especially direct way (inspired by Prolog programming) is to let each non-terminal be represented by a Church procedure; here constituency is embodied by one procedure calling another—that is by causal dependence.
We have used the procedure sample, which applies a thunk (to no arguments), resulting in a sample. sample is in fact trivial—it can be defined by:
(define (sample distribution) (apply distribution '()))
Now, let's look at one of the procedures defining our PCFG in detail.
(define VP (lambda () (map sample (multinomial (list (list V AP) (list V NP)) (list (/ 1 2) (/ 1 2))))))
When VP is called it maps sample across a list which is sampled from a multinomial distribution: in this case either (V AP) or (V NP). These two lists correspond to the right-hand sides (RHSs) of the rules <math>VP \longrightarrow V AP</math> and <math>VP \longrightarrow V NP</math> in the standard representation of PCFGs. These are lists that consist of symbols which are actually the names of other procedures. Therefore when sample is applied to them, they themselves recursively sample their RHSs until no more recursion can take place. Note that we have wrapped our terminal symbols up as thunks so that when they are sampled they deterministically return the terminal symbol.
While it is most common to use PCFGs as models of strings (for linguistic applications), they can be useful as components of any probabilistic model where constituent structure is required. For instance, if you examine the Rational Rules model above, you will note that the hypothesis space of rules is generated from a PCFG—here the constituent structure is used to ensure that the parts of rules combine together into meaningful rules, that is, to provide a compositional semantics.
LISP is famous for its higher-order list manipulating functions (which you can read about here: Deterministic List Manipulating Functions.) These functions extract common patterns of recursion, resulting in clearer more parsimonious code.
One particular function, called fold is especially powerful: it can be used to do any list-based recursion. In the probabilistic setting there exists an important related procedure called unfold which recursively builds up lists. unfold takes three arguments. The first is some object, the second is a transition function, which returns the next object given the last one. The last argument is a predicate that stops the recursion.
(define (unfold current transition stop?) (if (stop? current) '() (pair current (unfold (transition current) transition stop?))))
With unfold defined we can now refactor our bigram model.
One thing to note about the PCFG example is that the each of the procedures is nearly identical. This suggests that we could write the grammar more succinctly. It turns out that there is a generalization of unfold called tree-unfold which will do the trick. Using tree-unfold we can rewrite our PCFG in a way that abstracts out the recursive structure, and looks much more like the standard notation for PCFGs:
Note that this samples a hierarchical (or "parenthesized") sequence of terminals. How would you "flatten" this to return a sequence without parentheses?
Exercise: write a version of the preceding PCFG that draws the RHS distributions from a Dirichlet distribution (as in the hierarchical n-gram model).
- ↑ A rational analysis of rule-based concept learning. N. D. Goodman, J. B. Tenenbaum, J. Feldman, and T. L. Griffiths (2008). Cognitive Science. 32:1, 108-154.
- ↑ For example: Compositionality in rational analysis: Grammar-based induction for concept learning. N. D. Goodman, J. B. Tenenbaum, T. L. Griffiths, and J. Feldman (2008). In M. Oaksford and N. Chater (Eds.). The probabilistic mind: Prospects for Bayesian cognitive science. A Bayesian Model of the Acquisition of Compositional Semantics. S. T. Piantadosi, N. D. Goodman, B. A. Ellis, and J. B. Tenenbaum (2008). Proceedings of the Thirtieth Annual Conference of the Cognitive Science Society.
- ↑ Learning Structured Generative Concepts. A. Stuhlmueller, J. B. Tenenbaum, and N. D. Goodman (2010). Proceedings of the Thirty-Second Annual Conference of the Cognitive Science Society.