Hypothetical Reasoning and Observation: Conditioning
- To return to the top level: Probabilistic Models of Cognition Tutorial.
By Timothy J. O'Donnell, Noah D. Goodman, Andreas Stuhlmueller, and the Church Working Group.
We have built up a tool set for constructing probabilistic generative models. These can represent knowledge about causal processes in the world: running one of these programs samples a particular outcome by sampling a "causal history" for that outcome. However, the power of a causal model lies in the flexible ways it can be used to reason about the world. For instance, if we have a generative model in which X depends on Y, we may ask: "assuming I have observed X, how must Y have been?" In this section we describe how a wide variety of inferences can be made from a single generative model by conditioning the model on an assumed or observed fact.
Some of the inference problems of particular interest to linguistics are: given beliefs about universal grammar and some data from a particular language, what should we believe about that language's structure? This problem is simultaneously the problem facing the linguist and the child trying to learn a language. Another inference problem of interest to linguists is: given some beliefs about the structure of my language, and a sentence, what should I believe about the structure of that sentence? This is just the parsing problem. A final example of a linguistic inference problem is: given some beliefs about the syntactic structure of my language (and beliefs about others beliefs about that), and a particular thought I want to express, how do I encode the thought syntactically? This is the production problem.
Contents |
Hypothetical Reasoning with query
Suppose that we know some fixed fact, and we wish to hypothesize about how a generative model could give rise to it. In Church we can use a function called query with the following interface.
(query generative-model expression-we-are-interested-in conditions-that-must-be-met)
query takes three arguments. The first is some generative model expressed as a series of define statements. The second is an expression, called the query expression, which represents the aspects of the computation that we are interested in. The last argument is the condition that must be met; this may encode observations, data, or more general assumptions. It is called the conditioner.
Consider the following simple generative process.
{{#churchserv: (define A (if (flip) 1 0)) (define B (if (flip) 1 0)) (define C (+ A B)) C }}
This process samples two digits 0/1 and adds the result. The value of the final expression here is either 0, 1 or 2. A priori, each of the variables A, B has .5 probability of being 1 or 0. However, suppose that we have observed that C it is equal to 2. How does this change the space of possible values that variable A can take on? It is obvious that A must be equal to 1 for this result to happen. We can see this in the following Church query.
{{#churchserv: (define samples
(repeated-mh-def-query 100 100 '(define A (if (flip) 1 0)) '(define B (if (flip) 1 0)) '(define C (+ A B)) 'A '(= C 2) (get-current-environment) ) )
(hist samples "A") }}
Church has several implementations of query, here we have used repeated-mh-def-query (an implementation using an efficient Markov chain Monte Carlo algorithm). We have passed it several arguments. The first two and last we will ignore for now (they are control parameters for the underlying algorithm). The rest of the arguments are: the series of defines for the variables A, B, and C. After that is the query-expression, this is an expression whose value we are interested in; in this case A. After that comes the conditioner. This is the constraint that must be satisfied, i.e. it is a boolean expression that must evaluate to true. You can see from the histogram that the distribution over A subject to the constraint that C is 2 requires that A be equal to 1 with probability 1, as desired.
In order to understand query it is useful to look at a simple, inefficient, but correct implementation known as rejection sampling. Rejection sampling is an instance of the method of generate-and-test. The idea is simple. We just run our generative model in the forward direction. After we have sampled a value for every random variable in the program (including the query-expression) we compute the conditioner. If it is true we keep the sample, if it is false we throw it away. This can be represented by the following schematic Church code (for runnable code see Rejection Query).
(define (rejection-query defines query-expression conditioner environment) (eval defines environment) (if (equal? (eval conditioner environment) true) (eval query-expression environment) (rejection-query defines query-expression conditioner environment)))
Recall that a Church procedure implicitly denotes a distribution, which it samples from. The meaning of a query expression, as specified by rejection-query, can thus be thought of as the distribution over the value returned by the query-expression conditioned on the conditioner being true. We will see below that this is just the formal definition of conditional probability in probability theory.
rejection-query is sound but not efficient: even if we are sure that our model can satisfy the condition, it will often take an exponentially large number of samples to find computations that do so. The AI literature is replete with algorithms and techniques for searching non-deterministic spaces more efficiently, and several of these have been adapted into Church to give implementations of query that may be more efficient on various cases. Let us emphasize that, though they vary in efficiency, each of these implementations is universal: it samples from the appropriate conditional distribution for a Church query over any Church model.
query and Conditioning in Probability Theory
In probability theory if we have a joint distribution <math>P(A,B)</math>, we can define the conditional distribution of <math>A</math> given that <math>B</math> has the value <math>b_1</math> by the following formula.
- <math>P(A|B=b_1)=\frac{P(A,B=b_1)}{P(B=b_1)}</math>
Note that the denominator is equal to the probability of <math>B=b_1</math> in the marginal distribution <math>P(B)</math>.
- <math>P(A|B=b_1)=\frac{P(A,B=b_1)}{\sum_A P(A, B=b_1)}</math>
To understand what this means it is useful to refer back to our earlier example of a joint probability distribution.
<math>B=</math>true | <math>B=</math>false | <math>P(A)</math> | |
---|---|---|---|
<math>A=</math>true | .1 | .3 | .4 |
<math>A=</math>false | .5 | .1 | .6 |
<math>P(B)</math> | .6 | .4 |
When we condition on the <math>B=b_1</math> we select the first column in the table above, which corresponds to all of the combinations in the joint distribution where the constraint holds. However, we are interested in the probability of <math>A</math> in this set of circumstances. This means that we need to renormalize so that the probabilities sum to one again. The normalization constant which will make these numbers sum to one, is just the total probability mass present in that column, in other words the marginal probability <math>P(B=b_1)</math>.
<math>B=</math>true | <math>B=</math>false | <math>P(A)</math> | |
---|---|---|---|
<math>A=</math>true | .1666 | .75 | |
<math>A=</math>false | 0.8333333 | .25 | |
<math>P(B)</math> | .6 | .4 |
The important thing to notice about the process of conditionalization is just that it renormalizes the joint probabilities in the part of the space where the condition is true. In other words the relative strength of belief in the outcomes of <math>A</math> which are consistent with the conditioner stays the same. Conditioning is just multiplying by a normalizing constant. This is an example of the law of conservation belief: all else being equal, other than the changes that are implied by our conditioning statement, keep all of our other beliefs the same.
Now consider rejection-query as described in the last section. Remember that it works by sampling from the generative model and throwing away those samples that don't satisfy the conditioner. The samples from the generative model are samples distributed according to the joint distribution over the variables specified by our Church program. By throwing away the samples that are inconsistent with the conditioner we are implicitly renormalizing using just that part of the space that matches the conditioner. In other words, rejection-query is an implementation of conditioning.
Bayes' Rule and query
Often presentations of Bayesian methods begin with a statement of Bayes' rule.
- <math>P(H|D=d) = \frac{P(D=d|H) P(H)}{\sum_{H} P(D=d|H) P(H)}</math>
Where <math>H</math> is a random variable ranging over hypotheses and <math>D</math> is a random variable ranging over data, and we condition on some particular instantiation of a data set. The left-hand side of this equation, <math>P(H|D=d)</math> is called the posterior distribution over hypotheses over data. <math>P(D=d|H)</math> is called the likelihood of the hypothesis, <math>P(H)</math> is called the prior distribution over hypotheses and
- <math>P(D) = \sum_{H} P(D=d|H) P(H)</math>
is called the evidence or marginal likelihood of the data.
The importance of Bayes' rule lies in the fact that while generative models express the distribution <math>P(D|H) P(H)</math>, the induction, learning, or inference problem is usually expressed by the distribution <math>P(H|D)</math>. Bayes' rule gives us formula for linking these distributions with one another. In terms of query Bayes' rule can be thought in the following way.
(query '(define a ...) ;marginalizing over many variables '(define b ... a...) ... '(define o ... a... b...) 'b ;hypothesis '(equal? o data)) ;data
The hypothesis <math>H</math> in Bayes' rule corresponds to whatever random variable is expressed in our query-expression. The data <math>D</math> is encoded in our conditioner. As discussed above a Church program may sample an unbounded number of random variables. Ignoring sampled variables is the Church equivalent of marginalization. Thus a Church query will marginalize away all the variables besides the one corresponding to the query expression. The statement of Bayes' rule above is deceptively simple. The description of Bayes' rule in terms of query shows that the hypothesis corresponds to whatever random variable you are interested in, and the data corresponds to whatever condition you wish to reason hypothetically about. The model that you use to do so can be arbitrarily complex.
Bayes' rule is really just about conditioning, and follows from the definition of a conditional distribution together with the chain rule. The chain rule says that the joint distribution over a set of random variables <math>{R_1, ..., R_n}</math> can always be rewritten in the following way.
- <math>P(R_1, ..., R_n) = P(R_1)P(R_2 | R_1)P(R_3 | R_1, R_2) ... P(R_n|R_1,...R_{n-1}) </math>
In fact, the joint can be rewritten in this way using any ordering of the variables. To see that this is true consider Church marginalization (forgetting) alongside rejection-query. Suppose we have access to a generative model defining the joint distribution <math>P(R_1, ..., R_n) </math>. First, sample a tuple of values for each random variable and forget everything besides the value <math>R_1=r_1</math>. Next use rejection-query to take a sample from the joint, conditional on <math>R_1=r_1</math>. Notice that the probability of <math>P(R_1=r_1)P(R_2=r_2|R_1=r_1)</math> is equal to the original probability of getting <math>P(R_1=r_1,R_2=r_2)</math>, marginalizing over the other variables. This process can be carried on for any number of variables.
The definition of conditional probability tells us that that the following equation holds.
- <math>P(H|D=d) = \frac{P(D=d, H)}{\sum_{H} P(D=d|H) P(H)}</math>
The chain rule gives us Bayes' rule by rewriting the numerator.
- <math>P(H|D=d) = \frac{P(D=d|H)P(H)}{\sum_{H} P(D=d|H) P(H)}</math>
The general situation expressed by query is the following.
<math>P(R_i |R_j=r) = \sum_{R_{l \neq i,j}} \left [ \frac{P(R_1, ..., R_j=r,..., R_n)}{\sum_{R_{k \neq j}} P( R_1, ..., R_j=r,..., R_n)} \right ]</math>
Propositional Assumptions: Complex Queries
It is natural to condition a generative model on a value for one of the variables declared in this model. However, one may also wish to ask for more complex hypotheticals: "what if P," where P is a complex proposition composed out of variables declared in the model.
Consider the following Church query
{{#churchserv: (define samples
(repeated-mh-def-query 100 100 '(define A (if (flip) 1 0)) '(define B (if (flip) 1 0)) '(define C (if (flip) 1 0)) '(define D (if (flip) 1 0))
'A '(< (+ A B C D) 3) (get-current-environment) ) )
(hist samples "A") }}
In this query we have defined a generative model that samples 4 instances of 0/1 digits. However, we have conditioned on the complex assumption that the sum of these random variables is less than 3. This involves a new random variable, (< (+ A B C D) 3). This latter random variable did not appear anywhere in the generative model. In the traditional presentation of conditional probabilities we usually think of conditioning as observation: it explicitly enforces random variables to take on certain values. For example, when we say <math>P(A|B=b)</math> it explicitly require <math>B = b</math>. In order to express the above query in this way, we could add the complex variable to the generative model, then condition on it. However this intertwines the hypothetical assumption with the generative model, and this is not what we want: we want a simple model which supports many queries, rather than a complex model in which only a prescribed set of queries is allowed.
Writing models in Church allows the flexibility to build complex random expressions like this as needed, making assumptions that are phrased as complex propositions, rather than simple observations. Using query allows us to do correct inference in these cases.