Hierarchical Models (ESSLLI)
- To return to the top level: ESSLLI Tutorial.
Human knowledge is organized hierarchically into levels of abstraction. For instance, the most common or basic-level categories (e.g. dog, car) can be thought of as abstractions across individuals, or more often across subordinate categories (e.g., poodle, Dalmatian, Labrador, and so on). Multiple basic-level categories in turn can be organized under superordinate categories: e.g., dog, cat, horse are all animals; car, truck, bus are all vehicles. Some of the deepest questions of cognitive development are: How does abstract knowledge influence learning of specific knowledge? How can abstract knowledge be learned? In this section we will see how such hierarchical knowledge can be modeled with hierarchical generative models: generative models with uncertainty at several levels, where lower levels depend on choices at higher levels.
Hierarchical models allow us to capture the shared latent structure underlying observations of multiple related concepts, processes, or systems -- to abstract out the elements in common to the different sub-concepts, and to filter away uninteresting or irrelevant differences. Perhaps the most familiar example of this problem occurs in learning about categories. Consider a child learning about a basic-level kind, such as dog or car. Each of these kinds has a prototype or set of characteristic features, and our question here is simply how that prototype is acquired.
The task is challenging because real-world categories are not homogeneous. A basic-level category like dog or car actually spans many different subtypes: e.g., poodle, Dalmatian, Labrador, and such, or sedan, coupe, convertible, wagon, and so on. The child observes examples of these sub-kinds or subordinate-level categories: a few poodles, one Dalmatian, three Labradors, etc. From this data she must infer what it means to be a dog in general, in addition to what each of these different kinds of dog is like. Knoweldge about the prototype level includes understanding what it means to be a prototypical dog and what it means to be non-prototypical, but still a dog. This will involve understanding that dogs come in different breeds which share features between them, but also differ systematically as well.
As a simplification of this situation consider the following generative process. We will draw marbles out of several different bags. There are five marble colors. Each bag has a certain "prototypical" mixture of colors. This generative process is represented in the following Church example using the Dirichlet distribution (the Dirichlet is the higher-dimensional analogue of the Beta distribution).
Note that we are using the operator mem that we introduced in the first part of the tutorial. mem is particularly useful when writing hierarchical models because it allows us to associate arbitrary random draws with categories across entire runs of the program. In this case it allows us to associate a particular mixture of marble colors with each bag. The mixture is drawn once, and then remains the same thereafter for that bag.
Run the code above multiple times. Each run creates a single bag of marbles with its characteristic distribution of marble colors, and then draws four samples of 20 marbles each. Intuitively, you can see how each sample is sufficient to learn a lot about what that bag is like; there is typically a fair amount of similarity between the empirical color distributions in each of the four samples from a given bag. In contrast, you should see a lot more variation across different runs of the code -- samples from different bags.
Now let's add a few twists: we will generate three different bags, and try to learn about their respective color prototypes by conditioning on observations. We represent the results of learning in terms of the posterior predictive distribution for each bag: a single hypothetical draw from the bag, using the expression (draw-marbles 'bag 1). We will also draw a sample from the posterior predictive distribution on a new bag, for which we have no observations.
This generative model describes the prototype mixtures in each bag, but it does not attempt learn a common higher-order prototype. It is like learning separate prototypes for subordinate classes poodle, Dalmatian, and Labrador, without learning a prototype for the higher-level kind dog -- or learning about any functions that are shared across the different lower-level classes or bags. Specifically, inference suggests that each bag is predominantly blue, but with a fair amount of residual uncertainty about what other colors might be seen. There is no information shared across bags, and nothing significant is learned about bag-n as it has no observations and no structure shared with the bags that have been observed.
Now let us introduce another level of abstraction: a global prototype that provides a prior on the specific prototype mixtures of each bag.
Compared with inferences in the previous example, this extra level of abstraction enables faster learning: more confidence in what each bag is like based on the same observed sample. This is because all of the observed samples suggest a common prototype structure, with most of its weight on blue and the rest of the weight spread uniformly among the remaining colors. Statisticians sometimes refer to this phenomenon of inference in hierarchical models as "sharing of statistical strength": it is as if the sample we observe for each bag also provides a weaker indirect sample relevant to the other bags. In machine learning and cognitive science this phenomenon is often called learning to learn or transfer learning. Intuitively, knowing something about bags in general allows the learner to transfer knowledge gained from draws from one bag to other bags. This example is analogous to seeing several examples of different subtypes of dogs and learning what features are in common to the more abstract basic-level dog prototype, independent of the more idiosyncratic features of particular dog subtypes.
Learning about shared structure at a higher level of abstraction also supports inferences about new bags without observing any examples from that bag: a hypothetical new bag could produce any color, but is likely to have more blue marbles than any other color. Analogously, we can imagine hypothetical, previously unseen, new subtypes of dogs that share the basic features of dogs with more familiar kinds but differ in other idiosyncratic ways.
A particularly striking example of "sharing statistical strength" or "learning to learn" can be seen if we change the observed sample for bag 3 to have only two examples, one blue and one orange. Replace the line (equal? (draw-marbles 'bag-3 6) '(blue blue blue blue blue orange)) with:
(equal? (draw-marbles 'bag-3 2) '(blue orange))
in both programs above. In a situation where we have no shared higher-order prototype structure, inference for bag-3 from these observations suggests that blue and orange are equally likely. However, when we have inferred a shared higher-order prototype, then the inferences we make for bag 3 look much more like those we made before, with six observations (five blue, one orange), because the learned higher-order prototype tells us that blue is most likely to be highly represented in any bag regardless of which other colors (here, orange) may be seen with lower probability.
Learning Overhypotheses: Abstraction at the Superordinate Level
Hierarchical models allow us to capture a more abstract and even more important "learning to learn" phenomenon, sometimes called learning overhypotheses. Consider how a child learns about living creatures (an example we adapt from the psychologists Liz Shipley and Rob Goldstone). We learn about specific kinds of animals -- dogs, cats, horses, and more exotic creatures like elephants, ants, spiders, sparrows, eagles, dolphins, goldfish, snakes, worms, centipedes -- from examples of each kind. These examples tell us what each kind is like: Dogs bark, have four legs, a tail. Cats meow, have four legs and a tail. Horses neigh, have four legs and a tail. Ants make no sound, have six legs, no tail. Robins and eagles both have two legs, wings, and a tail; robins sing while eagles cry. Dolphins have fins, a tail, and no legs; likewise for goldfish. Centipedes have a hundred legs, no tail and make no sound. And so on. Each of these generalizations or prototypes may be inferred from seeing several examples of the species.
But we also learn about what kinds of creatures are like in general. It seems that certain kinds of properties of animals are characteristic of a particular kind: either every individual of a kind has this property, or none of them have it. Characteristic properties include number of legs, having a tail or not, and making some kind of sound. If one individual in a species has four legs, or six or two or eight or a hundred legs, essentially all individuals in that species have that same number of legs (barring injury, birth defect or some other catastrophe). Other kinds of properties don't pattern in such a characteristic way. Consider external color. Some kinds of animals are homogeneous in coloration, such as dolphins, elephants, sparrows. Others are quite heterogeneous in coloration: dogs, cats, goldfish, snakes. Still others are intermediate, with one or a few typical color patterns: horses, ants, eagles, worms.
This abstract knowledge about what animal kinds are like can be extremely useful in learning about new kinds of animals. Just one example of a new kind may suffice to infer the prototype or characteristic features of that kind: seeing a spider for the first time, and observing that it has eight legs, no tail and makes no sound, it is a good bet that other spiders will also have eight legs, no tail and make no sound. The specific coloration of the spider, however, is not necessarily going to generalize to other spiders. Although a basic statistics class might tell you that only by seeing many instances of a kind can we learn with confidence what features are constant or variable across that kind, both intuitively and empirically in children's cognitive development it seems that this "one-shot learning" is more the norm. How can this work? Hierarchical models show us how to formalize the abstract knowledge that enables one-shot learning, and the means by which that abstract knowledge is itself acquired (Kemp, Perfors and Tenenbaum, Developmental Science 2007).
We can study a simple version of this phenomenon by modifying our bags of marbles example, articulating more structure to the hierarchical model as follows. We now have two higher-level parameters: phi describes the expected proportions of marble colors across bags of marbles, while alpha, a real number, describes the strength of the learned prior -- how strongly we expect any newly encountered bag to conform to the distribution for the population prototype phi. For instance, suppose that we observe that bag-1 consists of all blue marbles, bag-2 consists of all green marbles, bag-3 all red, and so on. This doesn't tell us to expect a particular color in future bags, but it does suggest that bags are very regular—that all bags consist of marbles of only one color.
This model uses the gamma distribution as a prior on the regularity parameter. Gamma is a useful continuous distribution on the non-negative numbers; here are some examples of Gamma with different parameter values:
We have queried on the mixture of colors in a fourth bag, for which only one marble has been observed (orange). What we see is a very strong posterior predictive distribution focused on orange. This posterior is much stronger than the single observation for that bag can justify on its own. Instead, it reflects the learned overhypothesis that bags tend to be uniform in color.
To see that this is real one-shot learning, contrast with the "prior" predictive distribution for bag-n, where we have made no observations. You should see the predictive for bag-n collapse to a mostly flat distribution, since little has been learned in the hierarchical model about the specific colors represented in the overall population. Rather we have learned the abstract property that bags of marbles tend to be uniform in color, such that a single observation from a new bag is enough to make strong predictions about that bag even though little could be said prior to seeing the first observation.
The above code shows a histogram of the inferred values of alpha (actually, its log value), representing how strongly the learned distribution in phi constrains each individual bag as a learned prior -- how much each individual bag is expected to look like the prototype of the population. You should see that the inferred values of alpha are typically significantly less than 1 (or log less than 0). This means roughly that the learned prototype in phi should exert less influence on prototype estimation for a new bag than a single observation. Hence the first observation we make for a new bag mostly determines a strong inference about what that bag is like.
Now change the conditioning statement (the data) in the above code example as follows:
(and (equal? (draw-marbles 'bag-1 6) '(blue red green black red blue)) (equal? (draw-marbles 'bag-2 6) '(green red black black blue green)) (equal? (draw-marbles 'bag-3 6) '(red green blue blue black green)) (equal? (draw-marbles 'bag-4 1) '(orange)) )))
Intuitively, the observations for bags one, two and three should now suggest a very different overhypothesis: that marble color, instead of being homogeneous within bags but variable across bags, is instead variable within bags to about the same degree that it varies in the population as a whole. We can see this inference represented via two coupled effects. First, the inferred value of alpha is now significantly greater than 1 (log value greater than 0), asserting that the population distribution as a whole, phi, now exerts a strong constraint on what any individual bag looks like. Second, for a new 'bag-4 which has been observed only once, with a single orange marble, that draw is now no longer very influential on the color distribution we expect to see from that bag; the broad distribution in phi exerts a much stronger influence than the single observation.
Hierarchical Abstraction versus Lambda Abstraction
The Blessing of Abstraction
(This section is undergoing "renovation".)
Now let's investigate the relative learning speeds at different levels of abstraction. Suppose that we have a number of bags that are all identical. They mix black, blue and green balls in proportion 5:4:3, with no orange or red balls. But the learner doesn't know this. She observes only two balls from each of several bags. What can she learn about each individual bag, versus the population as a whole?
You should see that the overall prototype learned is reasonably close to the true distribution in the world (black, blue, green in proportions 5:4:3), as reflected in the distribution of colors across all the bags. This is also what the prototype for each bag should be, yet what the learner can infer about each individual bag's prototype may be far from this, because it is based mostly on the two draws she happens to see from that particular bag.
Explore the learning curve, adding more bags with just two marbles observed in the same overall proportions, or adding more draws from each bag. When there are many bags drawn from the same underlying distribution but each is observed sparsely, you should find that the overall prototype can be learned quite well even while the specific prototypes still have a fair amount of uncertainty or distance from their true predictive distribution.
Going back to our familiar categorization example, this suggests that a child could be quite confident in the prototype of "dog" while having little idea of the prototype for any specific kind of dog—learning more quickly at the abstract level than the specific level, but then using this abstract knowledge to constrain expectations about the specific level. To see an example of this with marbles, try replacing the conditioning statements in the box above with the following - a version of the basic-level prototype abstraction example we had above but with sparser observations for each bag.
(and (equal? (draw-marbles 'bag-1 2) '(blue black)) (equal? (draw-marbles 'bag-2 2) '(blue green)) (equal? (draw-marbles 'bag-3 2) '(blue orange)) (equal? (draw-marbles 'bag-4 2) '(blue red)) )))
Here the learned overall prototype shows a reasonably clear tendency towards blue over other colors, but there is significantly more variability in the distribution inferred for each individual bag's prototype, reflecting the particular colors that happened to be observed in two random draws.
You can see the same learning curve with the example of an overhypothesis about a superordinate class. Try the following conditioning data in the above box:
(and (equal? (draw-marbles 'bag-1 1) '(blue)) (equal? (draw-marbles 'bag-2 2) '(green green)) (equal? (draw-marbles 'bag-3 2) '(orange orange)) (equal? (draw-marbles 'bag-4 2) '(red red)) (equal? (draw-marbles 'bag-5 2) '(black black)) )))
Here the learner can infer that bags are homogeneous and apply that abstract knowledge to constrain generalization for a bag that has only been observed once. That is, the learned prototype for bag-1 is strongly blue, even though we have just one example of a blue draw from that bag. The abstract constraint is inferred across several bags that have themselves only been observed twice each: each bag on its own provides only weak evidence that bags are homogeneous but across bags there is strong evidence for the overhypothesis.
In machine learning one often talks of the curse of dimensionality. The curse of dimensionality refers to the fact that as the number of parameters of a model increases (i.e. the dimensionality of the model increases), the size of the hypothesis space increases exponentially. This increase in the size of the hypothesis space leads to two related problems. The first is that the amount of data required to estimate model parameters (i.e. the sample complexity) increases rapidly as the dimensionality of the hypothesis space increases. The second is that the amount of computational work needed to search the hypothesis space also rapidly increases. Thus, increasing model complexity by adding parameters can result in serious problems for inference.
On the other hand, we have seen that adding additional levels of structure in a Bayesian model can make it possible to learn more with fewer observations. This happens because learning at the abstract level can be quicker than learning at the specific level. Because this ameliorates the curse of dimensionality, we refer to these effects as the blessing of abstraction.
In general, the blessing of abstraction can be surprising because our intuitions often suggest that adding more hierarchical levels to a model increases the models complexity. More complex models should be harder to learn, rather than easier. On the other hand, it has long been understood in cognitive science that learning is made easier by the addition of constraints on possible hypothesis. For instance, proponents of universal grammar have long argued for a highly constrained linguistic system on the basis of learnability. Their theories often have an explicitly hierarchical flavor. Hierarchical Bayes' can be seen as a way if introducing soft, probabilistic constraints on hypotheses that allow for the transfer of knowledge between different kinds of observations.
Example: The Shape Bias
Example: Causal Reasoning
Example: X-Bar Theory
(This example comes from an unpublished manuscript by O'Donnell, Goodman, and Katzir)
One of the central problems in generative linguistics has been to account for the ease and rapidity with which children are able to acquire their language from noisy, incomplete and sparse data. One suggestion as to how this can happen is that the space of possible natural languages varies parametrically. The idea is that there are a number of higher-order constraints on structure that massively reduce the complexity of the learning problem. Each constraint is the result of a parameter taking on one of a small set of values. (Thus the phrase "principles and parameters".) The child needs only see enough data to set these parameters and the details of construction-specific structure will then generalize across the rest of the constructions of their language.
One example comes from the realm of X-bar theory and headedness. X-bar theory provides a hierarchical model for phrase structure. All phrases are built out of the same basic template:
- <math> XP \longrightarrow Spec X'</math>
- <math> X' \longrightarrow X Comp</math>
Where <math>X</math> is a lexical category such as <math>N</math> (noun), <math>V</math> (verb), etc. The proposal is that all phrase types have the same basic "internal geometry." They have a head—the word of type <math>X</math>. They also have a specifier (<math>Spec</math>) and a complement (<math>Comp</math>), the complement is more closely associated with the head than the specifier. The set of categories that can appear as complements and specifiers for a particular category of head is usually thought to be specified by universal grammar (but may also vary parametrically).
An important way in which languages vary is the order in which heads appear with respect to their complements (and specifiers). Within a language there tends to be a dominant order, often with exceptions for some category types. For instance, English is primarily a head-initial language. In verb phrases, for example, the direct object (complement noun phrase) of a verb appears to the right of the head. However, there are exceptional cases such as the order of (simple) adjective and nouns: adjectives appear before the noun rather than after it (although more complex complement types such as relative clauses appear after the noun).
The fact that languages show consistency in head directionality could be of great advantage to the learner; after encountering a relatively small number of phrase types and instances the learner of a consistent language could learn the dominant head direction in their language, transferring this knowledge to new phrase types. The fact that within many languages there exceptions suggests the need for a flexible way of inferring the headedness.
The following ChurchServ window shows a highly simplified model of X-Bar structure.
First try increasing the number of copies of (D N) observed. What happens? Now try changing the data to '((D N) (T V) (V Adv)). What happens if you condition on additional instance of (V Adv) how about (Adv V)?
What we see in this example is a simple probabilistic model capturing a version of "principles and parameters" (in this case the principle is the X-bar schema and the parameter is headedness of phrases). Because it is probabilistic, systematic inferences will be drawn despite exceptional sentences or even phrase types. More importantly, due to the blessing of abstraction, the overall headedness of the language can be inferred from very little data—before the learner is very confident in the headedness of individual phrase types.