Inference in Structured Models

From Church Wiki
Jump to: navigation, search
To return to the top level: Probabilistic Models of Cognition Tutorial.

By Timothy J. O'Donnell, Noah D. Goodman, Andreas Stuhlmueller, and the Church Working Group.

So far we have seen how we can build probabilistic generative models using Church and talked about some of the kinds of probability distributions we can define with this approach. We have also seen how we can do hypothetical reasoning with these models—including hypothetical reasoning over real-world data—using conditioning. In this section we will explore some interesting and unintuitive phenomena that arise when doing this kind of probabilistic inference.

Contents

Independence in the Generative Model and Conditional Model: Explaining Away

We saw earlier how our generative models explicitly encode our conditional independence assumptions about causal processes that generate data. The generative model is a model of our knowledge about how the world works, since many things in the world are causally unrelated, it necessarily involves many independencies. In models in cognitive science in general and linguistics in particular this kind of independence is often reflected in a certain modularity in the system. For instance, in generative models of language it is often assumed that phonology and semantics don't interact directly, but only through the syntax.

There is an important phenomenon in probabilistic inference, known as explaining away. [1]The gist of explaining away is that variables that are independent in a generative model often become dependent when conditioning. The textbook example of the phenomenon arises in the sprinkler problem, which we saw earlier in the tutorial. For convenience, the Bayes' net corresponding the sprinkler problem is repeated here.

Church-sprinkler2.png

Note that we have changed some of the probabilities from the earlier example. Let's look at the joint prior distribution over rain and sprinkler.

{{#churchserv:

(define samples 
  (mh-query
   100 100
   
   (define rain (flip .2))
   (define sprinkler (flip .2))
   (define wet (if (or rain sprinkler) true (flip .05)))
   
   (list rain sprinkler)
  
   true
   )
  )
(hist samples "(Rain, Sprinkler)")

}}

Now suppose that you observe one morning that the lawn is wet. What happens to this joint distribution?

{{#churchserv:

(define samples 
  (mh-query 
   100 100
   
   (define rain (flip .2))
   (define sprinkler (flip .2))
   (define wet (if (or rain sprinkler) true (flip .05)))
   
   (list rain sprinkler)
  
   wet
   )
  )
(hist samples "(Rain, Sprinkler)")

}}


As can be seen from the output to this query, the posterior probability of both rain and sprinkler increases when we condition on the grass being wet—in accord with intuition. Now let's furthermore assume that someone tells you that the sprinkler system is broken. What is the posterior distribution on rain now?

{{#churchserv:

(define samples 
  (mh-query 
   100 100
   
   (define rain (flip .2))
   (define sprinkler (flip .2))
   (define wet (if (or rain sprinkler) true (flip .05)))
   
   rain
   
   (and wet (not sprinkler))
   )
  )
  (hist samples "Rain")

}}

The probability of rain is now much higher than before, again as we expect. What happens if instead we condition on the sprinkler having run for sure?

{{#churchserv:

(define samples 
  (mh-query 
   100 100
   
   (define rain (flip .5))
   (define sprinkler (flip .5))
   (define wet (if (or rain sprinkler) true (flip .05)))
   
   rain
   
   (and wet sprinkler)
   )
  )
  (hist samples "Rain")

}}

The probability of rain returns to its prior probability. This latter effect is what gives "explaining away" its name. If we are sure that the sprinkler has run, then it explains the grass being wet. The probability of it having rained in addition returns to its prior probability. This is a non-monotonic inference: adding additional information ("sprinkler") has decreased our degree of belief in "rain"—such inferences are impossible in classical logic.

Another example of such a phenomenon comes from vision. In vision, the luminance of a surface depends on two factors, the illumination of the surface (how much light is hitting it) and its reflectance. The actual luminance is the product of the two factors. Thus luminance is inherently ambiguous. The visual system has to determine what proportion of the luminance is due to reflectance and what proportion is due to the illumination of the scene. This has led to a famous illusion known as the checker shadow illusion discovered by Ted Adelson.

Checkershadow illusion small.png

The illusion results from the fact that in the image above both the square labeled A and the square labeled B are actually the same shade of gray. This can be seen in the figure below where they are connected by solid gray bars on either side.

Checkershadow proof small.png

What is happening here is that the presence of the cylinder is providing evidence that the illumination of square B is actually less than that of square A. Thus we see perceive square B as having higher reflectance since its luminance is identical to square A despite the fact that we believe there is less light hitting it. The following program implements a simple version of this scenario "before" we see the shadow cast by the cylinder.

{{#churchserv:

(define (noisy= target value variance)
  (= 0 (gaussian (- target value) variance)))

(define samples 
  (mh-query 
   100 100
   
   (define reflectance (gaussian 1 1))
   (define illumination (gaussian 3 0.5))
   (define luminance (* reflectance illumination))
   
   reflectance
   
   (noisy= 3.0 luminance 0.1)
  ))

(truehist samples "Reflectance")

}}

Now let's condition on the presence of the cylinder.

{{#churchserv:

(define (noisy= target value variance)
  (= 0 (gaussian (- target value) variance)))

(define samples 
  (mh-query 
   100 100
   
   (define reflectance (gaussian 1 1))
   (define illumination (gaussian 3 0.5))
   (define luminance (* reflectance illumination))
   
   reflectance
 
   (and (noisy= 3.0 luminance 0.1) (noisy= 0.5 illumination 0.1))
  ))

(truehist samples "Reflectance")

}}

Conditional inference takes into account all the different paths through the generative process that could have generated the data. Two variables on different causal paths can thus become dependent once we condition on the way the data came out. The important point is that the variables reflectance and illumination are conditionally independent in the generative model, but after we condition on luminance they become dependent: changing one of them affects the probability of the other. This phenomenon has important consequences for cognitive science modeling. Although our model of our knowledge of the world and language have a certain kind of modularity implied by conditional independence, as soon as we start using the model to do conditional inference on some data (e.g. parsing, or learning the language), formerly modularly isolated variables can become dependent.

We can express the general phenomenon with the following schematic Church query.

(query 
  '(define a (...))               
  '(define b (...))
  ...
  '(define data (... a... b...))
  'b                               
  '(and (equal? data some-value) (equal? a some-other-value) ))

Here we see that we have defined two independent variables a and b both of which are used to define the value of our data. If we condition on the data and a the posterior distribution on b is now dependent on a.

Bayesian Occam's Razor

A second phenomenon that arises in the context of Bayesian inference is known as Bayesian Occam's razor. Bayesian Occam's razor refers to the fact that "more complex" hypotheses about the data are penalized automatically. The notion of complexity here is "flexibility": a hypothesis which is flexible enough to generate many different observations is more complex, and will be less likely. Because more complex hypotheses describe a greater variety of data sets, they have to assign a lower probability to each one—a consequence of the law of conservation of belief. When we condition on some data, all else being equal, the posterior distribution over the hypotheses will favor the simpler ones because they have the "tightest" fit to the observations.

A simple case of Bayes Occam's razor comes from the size principle: of the many hypotheses consistent with the data the smallest rapidly become the most probable as more data is observed. The following ChurchServ box demonstrates this with a very simple model. Here we have two hypothesized sets of symbols. A has 6 elements and B has 3 elements. We choose one of the hypotheses at random and sample some number of symbols from it uniformly.


{{#churchserv: (define samples

  (mh-query
   100 100
   (define (hypothesis->set  hyp)
      (cond ((equal? hyp  'A) '(a b c d e f))
            ((equal? hyp  'B) '(a b c))))
   
   (define hypothesis (if (flip) 'A 'B))
   (define (observations N) (repeat N (lambda () (uniform-draw (hypothesis->set hypothesis)))))
   
   hypothesis
   
   (equal? (observations 1) '(a))
   )
  )
(hist samples "Size Principle")

}}

With a single observed a we already favor hypothesis B. What happens when we increase the amount of data we condition on?

{{#churchserv: (define samples

  (mh-query
   100 100
   (define (hypothesis->set  hyp)
      (cond ((equal? hyp  'A) '(a b c d e f))
            ((equal? hyp  'B) '(a b c))))
   
   (define hypothesis (if (flip) 'A 'B))
   (define (observations N) (repeat N (lambda () (uniform-draw (hypothesis->set hypothesis)))))
   
   hypothesis
   
   (equal? (observations 3) '(a b a))
   )
  )
(hist samples "Size Principle")

}}

As can be seen by this example, as the number of data points increases hypothesis-B rapidly comes to dominate the posterior distribution. Why is this happening? We sample observations uniformly from hypotheses; the law of conservation of belief tells us that under these circumstances the probability of a draw from hypothesis-A is <math>1/6</math>, while the probability of a draw from hypothesis-B is <math>1/3</math>. Thus the probability of drawing a set of observations from hypothesis-A is <math>1/6</math> raised to the number of observations, while the probability of drawing a set of observations from hypothesis-B is <math>1/3</math> raised to the number of observations. Clearly the later number decreases in the number of observations much more slowly than the former. The posterior distribution over hypotheses is given by:

<math>P(hypothesis | observations) \propto P(observations|hypothesis)P(hypothesis)</math>

Because our hypotheses are equally likely, this simplifies to

<math>P(hypothesis | observations) \propto P(observations|hypothesis)</math>

So we see that the the posterior distribution over hypotheses in this case is just the normalized likelihood of the observations given hypotheses. Because hypothesis-B is "simpler" it quickly dominates the posterior.

We note that the size principle is related to an influential proposal of idea in linguistics known as the subset principle. Intuitively the subset principle suggests that when two grammars both account for the same data, the grammar that generates a smaller language should be preferred.[2]

One way to understand Bayes Occam's razor is that due to the law of conservation of belief probabilistic inference takes into account implicit negative evidence. More complex hypotheses put probability mass on more observations. The absence of some of these possible observations from the data set can be taken as evidence against the more complex hypothesis. The important point about Bayes Occam's razor is that the prior distribution on hypotheses does not have to penalize complexity. The complexity of the hypothesis itself will lead to its being disfavored in the posterior distribution.

In our example above we have illustrated Bayes' Occam razor with examples based strictly on the size of the hypotheses involved, however, the principle is more general. Bayes' Occam razor says that all else being equal the hypothesis that assigns the highest likelihood to the data will dominate the posterior. Consider the following example

{{#churchserv:

(define samples 
  (mh-query 
   100 100
   
   (define (hypothesis->parameters  hyp)
      (cond ((equal? hyp  'A) (list '(a b c d) '(0.125 0.125 0.375 0.375)))
            ((equal? hyp  'B) (list '(a b c d) '(0.375 0.375 0.125 0.125)))))	   
   (define hypothesis  (if (flip) 'A 'B))   
   (define observations (repeat 5 (lambda () (apply multinomial (hypothesis->parameters hypothesis))))) 
      
   hypothesis
   
   (equal? observations '(a b b c d)
   )
  )
(hist samples "Bayes-Occam-Razor")

}}

In this example, both hypotheses have the same support, however, they are skewed as to which elements of the support they assign higher weight to. The data set we conditioned on also has exemplars of all the elements in the support of the two hypotheses. However, because there are more exemplars of elements from hypothesis A, this hypothesis is favored in the posterior. In the following example we reverse the conditioner to favor hypothesis B.


{{#churchserv:

(define samples 
  (mh-query 
   100 100
      
   (define (hypothesis->parameters  hyp)
      (cond ((equal? hyp  'A) (list '(a b c d) '(0.125 0.125 0.375 0.375)))
            ((equal? hyp  'B) (list '(a b c d) '(0.375 0.375 0.125 0.125)))))	   
   (define hypothesis  (if (flip) 'A 'B))   
   (define observations (repeat 5 (lambda () (apply multinomial (hypothesis->parameters hypothesis)))))
      
   hypothesis
   
   (equal? observations '(d c d a b))
   )
  )
(hist samples "Bayes-Occam-Razor")

}}

These last examples suggest another way to understand Bayes Occam's razor: the posterior distribution will favor hypotheses for which the data set is simpler in the sense that it is more "easily generated." Here more "easily" generated, means generated with higher probability. We will see a more striking example of this for compositional models at the end of this section of the tutorial.

Hierarchical Models and the Blessing of Abstraction

Human knowledge is organized hierarchically into levels of abstraction. For instance a basic-level category ("dog") is more abstract than a specific-level category ("poodle" or "dalmatian"). Some of the deepest questions of cognitive development are: how does abstract knowledge influence learning of specific knowledge? and, (how) can abstract knowledge be learned? In this section we explore the dynamics of learning in hierarchically structured models.

The phenomena discussed in the section often go under a number of different headings: learning to learn, learning inductive biases, or transfer learning. The intuitive idea is that higher levels of structure can constrain lower levels. A small amount of data can inform the learner which part of the higher level space the learner is in, which can in turn lead to faster learning of lower level structure.

For instance, imagine that you are learning, de novo, about dogs. What you actually observe are exemplars of specific kinds of dogs—a few poodles, a few dalmatians, etc. From this you must infer the prototypical dog, and also the prototypes of each specific kind of dog. These are related: the prototype of dog tells you what to expect for the prototype of poodle and dalmatian. Does it matter whether one learns the prototype of dog, or can one just learn the prototypes of poodle, dalmatian, etc directly?

As a simplification of this situation consider the following generative process. We will draw marbles out of several different bags. There are two colors of marble: red and blue. Each bag has a certain "prototypical" mixture of red and blue marbles, and 10 marbles are sampled from each bag according to its color mixture. This generative process is represented in the following Church example (this example uses the Dirichlet distribution).

{{#churchserv: (define bag->prototype

 (mem
  (lambda (bag)
    (dirichlet '(1 1)))))

(define (draw-marbles bag num-draws)

 (repeat num-draws
         (lambda () (multinomial '(red blue) (bag->prototype bag)))))

(list (draw-marbles 'bag-1 10) (draw-marbles 'bag-2 10)) }}

This generative model describes the "prototype" mixtures in each bag, but without a common higher-order prototype. (I.e. it represents "poodle" and "dalmatian" without "dog".) What happens if you condition this model on the observation that several draws from the first bag result in 'red? Do several draws from the second bag also result in 'red? How does this change with the number of observations? And what happens if you ask about the prototype of a new bag, bag-3, given these observations?

Now let us introduce another level of abstraction: a global prototype that provides a prior on the specific prototype mixtures of each bag.

{{#churchserv: (define prototype (dirichlet '(1 1))) (define bag->prototype

 (mem
  (lambda (bag)
    (dirichlet prototype))))

(define (draw-marbles bag num-draws)

 (repeat num-draws
         (lambda () (multinomial '(red blue) (bag->prototype bag)))))

(list (draw-marbles 'bag-1 10) (draw-marbles 'bag-2 10)) }}

What happens if you condition this model on the same examples as above? What does it infer about the prototype of a new bag?

You should have found that adding an additional level of abstraction has resulted in faster learning, and in the ability to make guesses about new bags without observing any example from that bag. This is often called learning to learn or transfer learning.

Now let's investigate the relative learning speeds at different levels of abstraction.

{{#churchserv: (mh-query

   100 100
(define global-prototype (dirichlet '(1 1)))
(define bag->prototype
  (mem
   (lambda (bag)
     (dirichlet global-prototype))))
(define (draw-marbles bag num-draws)
  (repeat num-draws
          (lambda () (multinomial '(red blue) (bag->prototype bag)))))
(list global-prototype (bag->prototype 'bag-1))
(and (equal? (draw-marbles 'bag-1 3) '(red red red))
          (equal? (draw-marbles 'bag-2 2) '(red red)))

) }}

There are several things to vary in this setup: how many samples are observed overall? How many samples are observed from each bag? In particular, what happens when there are many different bags, but only one sample is observed from each bag? You should find that the overall prototype is learned while the specific prototypes still have a fair amount of uncertainty. Going back to our kind example, this suggests that a child could be quite confident in the prototype of "dog" while having little idea of the prototype for any specific kind of dog—learning more quickly at the abstract level than the specific level, but then using this abstract knowledge to constrain expectations about the specific level.

In machine learning one often talks of the curse of dimensionality. The curse of dimensionality refers to the fact that as the number of parameters of a model increases (i.e. the dimensionality of the model increases), the size of the hypothesis space increases exponentially. This increase in the size of the hypothesis space leads to two related problems. The first is that the amount of data required to estimate model parameters (i.e. the sample complexity) increases rapidly as the dimensionality of the hypothesis space increases. The second is that the amount of computational work needed to search the hypothesis space also rapidly increases. Thus increasing model complexity by adding parameters can result in serious problems for inference. We have seen that adding additional levels of structure in a Bayesian model can make it possible to learn more with fewer observations. This happens because learning at the abstract level can be quicker than learning at the specific level. Because this ameliorates the curse of dimensionality, we refer to these effects as the blessing of abstraction.

So far we have looked at learning an abstract knowledge describing the expected ratio of colors in bags of marbles. We can, in the same way, learn other abstract properties, such as the degree of regularity. For instance, suppose that we observe that bag-1 consists of all red marbles and bag-2 consists of all blue marbles. This doesn't tell us much about the whether to expect red or blue marbles in a future bag, but it does suggest that bags are very regular—that all bags consist of marbles of only one color. To capture this knowledge we extend the prototype with a degree of regularity, using the gamma distribution.

{{#churchserv: (mh-query

   100 100
(define prototype-mixture (dirichlet '(1 1)))
(define prototype-regularity (gamma 2 2))
(define prototype (map (lambda (w) (* prototype-regularity w)) prototype-mixture))
(define bag->prototype
  (mem
   (lambda (bag)
     (dirichlet prototype))))
(define (draw-marbles bag num-draws)
  (repeat num-draws
          (lambda () (multinomial '(red blue) (bag->prototype bag)))))
(bag->prototype 'bag-3)
(and (equal? (draw-marbles 'bag-1 3) '(red red red))
          (equal? (draw-marbles 'bag-2 4) '(blue blue blue blue)))

) }}

Here we have queried on the mixture of colors in a third bag. What we see is that most samples be heavily weighted toward red or blue. Thus we have learned the abstract property that bags of marbles tend to be uniform in color.

Example: X-Bar Theory

One of the central problems in generative linguistics has been to account for the ease and rapidity with which children are able to acquire their language from noisy, incomplete and sparse data. One suggestion as to how this can happen is that the space of possible natural languages varies parametrically. The idea is that there are a number of higher-order constraints on structure that massively reduce the complexity of the learning problem. Each constraint is the result of a parameter taking on one of a small set of values. The child needs only see enough data to set these parameters and the details of construction-specific structure generalize across the rest of the constructions of their language.

One example of this sort of idea comes from the realm of X-bar theory and headedness. X-bar theory provides a hierarchical model for phrase structure. All phrases are built out of the same basic template:

<math> XP \longrightarrow Spec X'</math>
<math> X' \longrightarrow X Comp</math>

Where <math>X</math> is a lexical category such as <math>N</math> (noun), <math>V</math> (verb), etc. The proposal is that all phrase types have the same basic "internal geometry." They have a head—the word dominated by <math>X</math>. They also have a specifier (<math>Spec</math>) and a complement (<math>Comp</math>), the complement is more closely associated with the head than the specifier. The set of categories that can appear as complements and specifiers for a particular category of head is usually thought to be specified by universal grammar (but may also vary parametrically).

An important way in which languages vary is the order in which heads appear with respect to their complements (and specifiers). Within a language there tends to be a dominant order, often with exceptions for some category types. For instance, English is primarliy a head-initial language. In verb phrases, for example, the direct object (complement noun phrase) of a verb appears to the right of the head. However, there are exceptional cases such as the order of (simple) adjective and nouns: adjectives appear before the noun rather than after it (although more complex complement types such as relative clauses appear after the noun).

The fact that languages show consistency in head directionality could be of great advantage to the learner; after encountering a relatively small number of phrases types and instances the learner of a consistent language could learn the dominant head direction in their language, transferring this knowledge to new phrase types. The fact that within many languages the system is not exception-free also suggests that a probabilistic approach might be profitable.

The following ChurchServ window shows a highly simplified model of X-Bar structure.

{{#churchserv: (define (in-list? elem list)

 (if (null? list)
     false
     (if (equal? (first list) elem)
         true
         (in-list? elem (rest list)))))

(define (set-equal? set1 set2)

 (apply and (map (lambda (elem) (in-list? elem set2)) set1)))

(define targets '((D N)))

(define samples

  (mh-query 
   100 100
   (define categories '(D N T V A Adv))
   
   (define right-pseudocount (beta 1 1))
   (define left-pseudocount (- 1 right-pseudocount))
   (define head->direction-bias
      (mem
       (lambda (head)
         (beta right-pseudocount left-pseudocount))))
   
   (define head->comp
      (lambda (head)
        (case head
          (('D) 'N)
          (('T) 'V)
          (('N) 'A)
          (('V) 'Adv)
          (('A) 'none)
          (('Adv) 'none)
          (else 'error))))
   
   (define generate-phrase
      (lambda (head)
        (let* ((comp (head->comp head)))
          (if (equal? comp 'none)
              (list head)
              (if (flip (head->direction-bias head))
                  (list comp head) 
                  (list head comp))))))
   
   (define (generate-random-phrase) (generate-phrase (uniform-draw categories)))
   
   (define observations (repeat (length targets) generate-random-phrase))
   
   (generate-phrase 'N)
   
    (set-equal? observations targets)
   )
  )

(hist samples "Xbar") }}

First try increasing the number of copies of (D N) observed. What happens? Now try conditioning on '(set-equal? observations '((D N) (D N) (D N) (D N) (D N)(T V) (T V) (A) (Adv))). What happens if you condition on additional instance of (V Adv) how about (Adv V)?

Inference in Models with Unbounded Structure

We have seen a number of examples of the effects of probabilistic inference over models with fixed structure. It is also possible to do inference in complex models with unbounded structure. All of the effects described about will once again appear, and there are a few additional principles to consider.

We discussed the Bayesian Occam's razor above. This can be seen as a preference for semantically simpler hypotheses: a hypothesis with less flexibility is preferred if it fits the data equally well. When dealing with hypotheses of unbounded representational complexity there is also a syntactic simplicity preference: there is a preference for hypotheses with simpler representation. This emerges in a generative model because of the need to generate more complex hypotheses using more choices. The probability of a particular computation in Church is the product of the probability of all the evaluations of ERPs during that computation. All else being equal, large, complex structures with more probabilistic choices (more ERP evaluations) will be less probable than small structures with few random choices.

To illustrate this effect, consider the following Church program, which induces an arithmetic function from examples.

{{#churchserv: (define (random-arithmetic-expression)

 (if (flip 0.6)
     (if (flip) 'x (sample-integer 10))
     (list (uniform-draw '(+ -)) (random-arithmetic-expression) (random-arithmetic-expression))))

(define (procedure-from-expression expr)

 (eval (list 'lambda '(x) expr) (get-current-environment)))

(mh-query

100 100
(define my-expr (random-arithmetic-expression))
(define my-proc (procedure-from-expression my-expr))
my-expr
(and (= (my-proc 1) 3)) 

)

}}

In this example there are many extensionally equivalent ways to satisfy the condition, for instance 3 and (+ 1 2), but because the more complex expressions require more choices to generate, they are chosen less often.

This query has a very "strict" condition: the function must give 3 when applied to 1. As the amount of data increases this strictness will make inference increasingly hard. We can ease inference by relaxing the condition, only requiring equality with high probability.

{{#churchserv: (define (noisy= x y) (flip (expt 0.1 (abs (- x y)))))

(define (random-arithmetic-expression)

 (if (flip 0.6)
     (if (flip) 'x (sample-integer 10))
     (list (uniform-draw '(+ -)) (random-arithmetic-expression) (random-arithmetic-expression))))

(define (procedure-from-expression expr)

 (eval (list 'lambda '(x) expr) (get-current-environment)))

(mh-query

100 100
(define my-expr (random-arithmetic-expression))
(define my-proc (procedure-from-expression my-expr))
my-expr
(and (noisy= (my-proc 1) 3)
     (noisy= (my-proc 3) 5) ) 

) }}

This is an example of a very powerful technique in probabilistic programing: a difficult inference problem can often be relaxed into an easier problem by inserting a noisy operation. Such a relaxation will have a parameter (the noise parameter), and various "temperature" techniques can be used to get samples from the original problem, using samples from the relaxed problem. (The temperature techniques implemented in MIT-Church include parallel tempering, tempered transitions, and annealed importance sampling.)





  1. Pearl, J. Probabilistic Reasoning in Intelligent Systems, San Mateo: Morgan Kaufmann, 1988.
  2. The original formulation of the subset principle, named by Bob Berwick and based on a result by Dana Angluin, gives necessary and sufficient conditions for Gold style learnability of a class of languages. Essentially it states that a class of languages is learnable in the limit using this principle if every language in the class has a characteristic subset.
Personal tools