# Conditioning (ESSLLI)

*To return to the top level: ESSLLI Tutorial.*

We have built up a tool set for constructing probabilistic generative models. These can represent knowledge about causal processes in the world: running one of these programs samples a particular outcome by sampling a "causal history" for that outcome. However, the power of a causal model lies in the flexible ways it can be used to reason about the world. For instance, if we have a generative model in which X depends on Y, we may ask: "assuming I have observed X, how must Y have been?" In this section we describe how a wide variety of inferences can be made from a single generative model by *conditioning* the model on an assumed or observed fact.

Much of cognition can be understood in terms of conditional inference. In its most basic form, *causal reasoning* is conditional inference: given some observed effects, what were the likely causes? *Predictions* are conditional inferences in the opposite direction: given that I have observed some known cause, what are its likely effects? These inferences come from conditioning a probabilistic program expressing a "causal model", or an understanding of how effects depend on causes. The acquisition of that causal model, or *learning*, is also conditional inference at a higher level of abstraction: given our general knowledge of how causal relations operate in the world, and some observed events in which candidate causes and effects co-occur in various ways, what specific causal relations are likely to hold between these observed variables?

To see how these concepts apply in a domain that is not usually thought of as causal, consider language. The core questions of interest in the study of natural language are all at heart conditional inference problems. Given beliefs about the syntactic structure of my language, and an observed sentence, what should I believe about the syntactic structure of that sentence? This is just the *parsing* problem. The complementary problem of *speech production* is as follows: given some beliefs about the syntactic structure of my language (and beliefs about others' beliefs about that), and a particular thought I want to express, how should I encode the thought syntactically? Finally, the discovery or *acquisition* problem: given general knowledge about universal grammar and some data from a particular language, what should we believe about that language's structure? This problem is simultaneously the problem facing the linguist and the child trying to learn a language.

Parallel problems of conditional inference arise in visual perception, theory of mind (or intuitive psychology), and virtually every other domain of cognition. In visual perception, we observe an image or image sequence that is the result of rendering a three-dimensional physical scene onto our two-dimensional retinas. A probabilistic program can model both the physical processes at work in the world that produce natural scenes, and the imaging processes (the "graphics") that render images from scenes. *Perception* is then conditioning this program on some observed output image and inferring the scenes most likely to have given rise to it. In interacting with other people, we observe the actions of intentional agents that are the outputs of planning processes: programs that take as input an agent's mental states (beliefs and desires) and produce action sequences -- typically, for a rational agent, actions that are likely to produce the agent's desired states as reliably or efficiently as possible, given the agent's beliefs. A rational agent can *plan* their actions by conditional inference to infer what steps would be most likely to achieve their desired state. *Action understanding*, or interpreting an agent's observed behavior, can be expressed as conditioning a planning program (a "theory of mind") on the observed actions to infer the mental states that most likely gave rise to those actions, and to predict how the agent is likely to act in the future.

## Contents |

### Hypothetical Reasoning with `query`

Suppose that we know some fixed fact, and we wish to consider hypotheses about how a generative model could have given rise to that fact. In Church we can use a function called `query` with the following interface.

(query generative-model what-we-want-to-know what-we-know)

`query` takes three arguments. The first is some generative model expressed as a series of `define` statements. The second is an expression, called the *query expression*, which represents the aspects of the computation that we are interested in. The last argument is the condition that must be met; this may encode observations, data, or more general assumptions. It is called the *conditioner.*

Consider the following simple generative process.

This process samples three digits 0/1 and adds the result. The value of the final expression here is either 0, 1, 2 or 3. A priori, each of the variables `A`, `B`, `C` has .5 probability of being `1` or `0`. However, suppose that we have observed that the sum `D` is equal to 3. How does this change the space of possible values that variable `A` can take on? It is obvious that `A` must be equal to 1 for this result to happen. We can see this in the following Church query.

The output of `rejection-query` is a "guess" about the likely value of `A`, conditioned on `D` being equal to 3. We use `repeat` to take 100 guesses, which are then turned into a bar graph representing relative probabilities using the `hist` function. Because `A` must necessarily equal 1, the histogram shows 100% of the samples on that value.

Now suppose that we condition on `D` being greater than or equal to 2. Then `A` need not be 1, but it is more likely than not to be. (Why?) The corresponding histogram shows the appropriate distribution of "guesses" for `A` conditioned on this new fact.

In order to understand `query` it is useful to look at the simple, inefficient, but correct implementation known as *rejection sampling* (which we have used above). Rejection sampling is an instance of the method of *generate-and-test*. The idea is simple. We just run our generative model in the forward direction. After we have sampled a value for every random variable in the program (including the query-expression) we compute the conditioner. If it is true we keep the sample, if it is false we throw it away. This can be represented by the following schematic Church code:

(define (rejection-query ..defines.. query-expression conditioner) ..defines.. (define query-value query-expression) (define condition-value conditioner) (if (equal? condition-value true) query-value (rejection-query defines query-expression conditioner)))

(This is only schematic because we must avoid evaluating the query-expression and conditioner until *inside* the rejection-query function. The Church implementation does this by *de-sugaring*: transforming the code before evaluation.)

A Church procedure implicitly denotes a distribution, which it samples from. The meaning of a query expression, as specified by `rejection-query`, can thus be thought of as the distribution over the value returned by the query-expression conditioned on the conditioner being true. This is equivalent to the formal definition of *conditional probability* in probability theory. Conditional probabilities are often written <math>P(A=a|B=b)</math> for the probability that `A` has value `a` given that `B` has value `b` (but the meaning of `A` and `B` must be given elsewhere, unlike a Church program, which contains the full model specification). (For more on the relationship between Church query and conditional probability see The Meaning of Probabilistic Programs).

Unfortunately `rejection-query` is not efficient: even if we are sure that our model can satisfy the condition, it will often take a very large number of samples to find computations that do so. To do this, try making the probability of `A`, `B`, and `C` very low in the above example (eventually the query will time out and be killed):

Even for this very simple program, lowering the baserate by just one order of magnitude to 0.01 will make `rejection-query` impractical.

The AI literature is replete with algorithms and techniques for searching non-deterministic spaces more efficiently, and several of these have been adapted into Church to give implementations of `query` that may be more efficient in various cases. (Though they vary in efficiency, each of these implementations is universal: it samples from the appropriate conditional distribution for a Church query over any Church model given enough time.) One such implementation that we will often use is based on the *Metropolis Hastings* algorithm, a form of *Markov chain Monte Carlo* inference. For background on MH and MCMC, see these excellent introductions by David MacKay (Chapters 29 [1] and 30 [2] of Information Theory, Inference, and Learning Algorithms) or Radford Neal [3].

The `mh-query` implementation takes two extra arguments and returns a list of many samples. The first extra argument is the number of samples to take, and the second controls the "lag", the number of internal random choices that the algorithm makes in a sequence of iterative steps between samples. The total number of MH iterations, and hence the running time of the query, is the product of these two parameters. The workings of MH are beyond the scope of this tutorial, but very roughly, in three sentences: The algorithm implements a random walk or diffusion process (a *Markov chain*) in the space of possible program evaluations consistent with the conditioner. Each MH iteration is one step of this random walk, and the process is specially designed to visit each program evaluation with a long-run frequency proportional to its probability conditioned on the evidence (or probability of having produced the conditioner). With sufficiently long lags between samples, they will be approximately independent samples from the Church program's conditional distribution.

See what happens in the above query as you lower the baserate. Inference should not slow down appreciably, but it will become less stable and less accurate. It becomes increasingly difficult for MH to draw independent conditional samples by taking small random steps, so for a fixed lag (100 in the code above), the 100 samples returned will tend to be less representative of the true conditional inference. In this case, stable and accurate conditional inferences can still be achieved in reasonable time by increasing the number of samples to 500 (while holding the lag at 100).

### Example: Causal Inference in Medical Diagnosis, and the Nature of Human Probabilistic Reasoning

This classic Bayesian inference task is a special case of conditioning. Kahneman and Tversky, and Gigerenzer and colleagues, have studied how people make simple judgments like the following:

- "The probability of breast cancer is 1% for a woman at 40 who participates in a routine screening. If a woman has breast cancer, the probability is 80% that she will have a positive mammography. If a woman does not have breast cancer, the probability is 9.6% that she will also have a positive mammography. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?"

What is your intuition? Many people without training in statistical inference judge the probability to be rather high, typically between 0.7 and 0.9. The correct answer is much lower, less than 0.1, as we can see by evaluating this simple Church query.

Kahneman and Tversky named this kind of judgment error *base rate neglect*, because in order to make the correct judgment, one must realize that the key contrast is between the *base rate* of the disease, 0.01 in this case, and the *false alarm rate *, 0.096. The false alarm rate (or FAR for short) seems low compared to the likelihood (probability of a positive mammogram given lung cancer), but what matters is that it is almost ten times higher than the base rate of the disease. All three of these quantities are needed to compute the posterior probability of having breast cancer given a positive mammogram using the standard version of Bayes' rule (see The Meaning of Probabilistic Programs, subsection "Bayes rule and query"):

- <math>Posterior = \frac{Likelihood \times Prior}{Likelihood \times Prior + FAR \times (1-Prior)} = \frac{0.8 \times 0.01}{0.8 \times 0.01 + 0.096 \times 0.99} = 0.078.</math>

Gigerenzer and colleagues showed that this kind of judgment can be made much more intuitive to untrained reasoners if the relevant probabilities are presented as "natural frequencies", or the sizes of subsets of relevant possible outcomes:

- "On average, ten out of every 1000 women at age 40 who come in for a routine screen have breast cancer. Eight out of those ten women will get a positive mammography. Of the 990 women without breast cancer, 95 will also get a positive mammography. We assembled a sample of 1000 women at age 40 who participated in a routine screening. How many of those who got a positive mammography do you expect to actually have breast cancer?"

Now one can practically read off the answer from the problem formulation: 8 out of 103 (95+8) women in this situation will have breast cancer.

Gigerenzer (along with Cosmides, Tooby and other colleagues) has argued that this formulation is easier because of evolutionary and computational considerations: human minds have evolved to count and compare natural frequencies of discrete events in the world, not to add, multiply and divide decimal probabilities. But this argument alone cannot account for the very broad human capacity for causal reasoning. We routinely make inferences for which we haven't stored up sufficient frequencies of events observed *in the world.*(And often for which no one has told us the relevant frequencies, although perhaps we have been told about degrees of causal strength or base rates in the form of probabilities or other linguistic encoding). The frequency information given above essentially enumerates a table of all possible situations, which is simply not practical for complex real-world reasoning.

However, the basic idea that the mind is good at manipulating frequencies of situations, but bad at arithmetic on continuous probability values, can be extended to cope with novel situations if the frequencies that are manipulated can be frequencies of *imagined* situations. Recall that Church programs explicitly give instructions for sampling imagined situations, and only implicitly specify probability distributions. If human inference is similar to a Church query then it would readily create and manipulate imagined situations, and this could explain both why the frequency framing of Bayesian probability judgment is natural to people and how people cope with rarer and more novel situations. The numbers given in the frequency formulation (or close approximations thereof) can be read off a tree of evaluation histories for 1000 calls of the Church program that specifies the causal model for this problem:

Each path from root to leaf of this tree represents a sequence of random choices made in evaluating the above program (the first flip for breast-cancer, the second for positive-mammogram), with the number of traversals and the sampled value labeling each edge. (Because this is 1000 *random* samples, the number are close (but not exactly) those in the Gigerenzer, et al, story.) Selecting just the 105 hypothetical cases of women with a positive mammogram, and computing the fraction of those who also have breast cancer (7/105), corresponds exactly to `rejection-query`. Thus, we have used the causal representation in the above church program to manufacture frequencies which can be used to arrive at the inference that relatively few women with positive mammograms actually have breast cancer.

Yet unlike the rejection-query people are quite bad at reasoning in this scenario. Why? One answer is that people don't represent their knowledge in quite the form of this simple church program. Indeed, Krynski and Tenenbaum (JEP General, 2007) argued that human statistical judgment is fundamentally based on conditioning more explicit causal models: they suggested that "base rate neglect" and other judgment errors may occur when people are given statistical information that cannot be easily mapped to the parameters of the causal models they intuitively adopt to describe the situation. In the above example, they suggested that the notion of a false alarm rate is not intuitive to many people -- particularly when the false alarm of a medical test is ten times higher than the base rate of the disease that the test is intended to diagnose! They showed that "base rate neglect" could be eliminated, and correct response rates more than doubled, by reformulating the breast cancer problem (and other classics from the judgment literature) in terms of more intuitive causal models. For example, consider their version of the breast cancer problem (the exact numbers and wording differed slightly):

- "1% of women at age 40 who participate in a routine screening will have breast cancer. Of those with breast cancer, 80% will receive a positive mammogram. 20% of women at age 40 who participate in a routine screening will have a benign cyst. Of those with a benign cyst, 50% will receive a positive mammogram due to unusually dense tissue of the cyst. All others will receive a negative mammogram. Suppose that a woman in this age group has a positive mammography in a routine screening. What is the probability that she actually has breast cancer?"

This question is easy for people to answer -- empirically, just as easy as the frequency-based formulation given above. We may conjecture this is because the relevant frequencies can be computed from a simple Church query on the following intuitive causal model:

Because this causal model -- this Church program -- is more intuitive to people, they can imagine the appropriate situations, despite having been given percentages rather than frequencies.
What makes this causal model more intuitive than the one above with an explicitly specified false alarm rate? Essentially we have replaced probabilistic dependencies on the "non-occurrence" of events (e.g., the dependence of a positive mammogram on *not* having breast cancer) with dependencies on explicitly specified alternative causes for observed effects (e.g., the dependence of a positive mammogram on having a benign cyst).

A causal model framed in this way can scale up to significantly more complex situations. Recall our more elaborate medical diagnosis network from the previous section, which was also framed in this way using noisy-logical functions to describe the dependence of symptoms on disease:

You can use this model to infer conditional probabilities for any subset of diseases conditioned on any pattern of symptoms. Try varying the symptoms in the conditioning set or the diseases in the query, and see how the model's inferences compare with your intuitions. For example, what happens to inferences about lung cancer and TB in the above model if you remove chest pain and shortness of breath as symptoms? (Why? Consider the alternative explanations.) More generally, we can condition on any set of events -- any combination of symptoms and diseases -- and query any others. We can also condition on the negation of an event using `(not ...)`: e.g., how does the probability of lung cancer (versus TB) change if we observe that the patient does *not* have a fever, does *not* have a cough, or does not have either symptom?

A Church program thus effectively encodes the answers to a very large number of possible questions in a very compact form, where each question has the form, "Suppose we observe X, what can we infer about Y?". In the program above, there are <math>3^9=19683</math> possible simple conditioning sets (possible X's) corresponding to conjunctions of events or their negations (because the program has 9 stochastic Boolean-valued functions, each of which can be observed true, observed false, or not observed). Then for each of those X's there are a roughly comparable number of Y's, corresponding to all the possible conjunctions of variables that can be in the query set Y, making the total number of simple questions encoded on the order of 100 million. In fact, as we will see below when we describe complex queries, the true number of possibe questions encoded in just a short Church program like this one is very much larger than that; usually the set is infinite. With `query` we can in principle compute the answer to every one of these questions. We are beginning to see the sense in which probabilistic programming provides the foundations for constructing a true *language of thought*, as described in the Introduction: a finite system of knowledge that compactly and efficiently supports an infinite number of inference and decision tasks

Expressing our knowledge as a probabilistic program of this form also makes it easy to add in new relevant knowledge we may acquire, without altering or interfering with what we already know. For instance, suppose we decide to consider behavioral and demographic factors that might contribute causally to whether a patient has a given disease:

Under this model, a patient with coughing, chest pain and shortness of breath is likely to have either lung cancer or TB. Modify the above code to see how these conditional inferences shift if you also know that the patient smokes or works in a hospital (where they could be exposed to various infections, including many worse infections than the typical person encounters). More generally, the causal structure of knowledge representation in a probabilistic program allows us to model intuitive theories that can grow in complexity continually over a lifetime, adding new knowledge without bound.

### Application: Bayesian networks as expert systems in AI

In AI, causal models of this sort are often called *Bayesian networks*, *Belief networks*, or *Directed graphical models*. Bayesian networks have revolutionized the field of *expert systems*, which aims to automate the knowledge of human experts in forms that machines can reason about. Specifically in the context of medical diagnosis, the QMR (or "Quick medical reference") Bayesian network was an early landmark in the field. Just like our examples above, QMR has a two-layer diseases-cause-symptoms structure, with noisy-logical functions relating diseases to symptoms, but it is much bigger in order to capture all the diseases a typical physician has to deal with on a regular basis:

More recently, much more complex and temporally extended Bayesian networks are starting to play a role in systems biology and personalized genomic medicine, as in these examples from Kenneth Drake, Serologix:

## Reasoning with Arbitrary Propositions: Complex Queries

It is natural to condition a generative model on a value for one of the variables declared in this model. However, one may also wish to ask for more complex hypotheticals: "what if P," where P is a complex proposition composed out of variables declared in the model.

Consider the following Church `query`.

This query has the same content as an example above but the syntax is importantly different. We have defined a generative model that samples 3 instances of 0/1 digits. However, we have conditioned on the complex assumption that the sum of these random variables is greater than or equal to 2. This involves a new random variable, `(>= (+ A B C) 2)`. This latter random variable *did not appear anywhere in the generative model*. In the traditional presentation of conditional probabilities we usually think of conditioning as *observation*: it explicitly enforces random variables to take on certain values. For example, when we say <math>P(A|B=b)</math> it explicitly require <math>B = b</math>. In order to express the above query in this way, we could add the complex variable to the generative model, then condition on it. However this intertwines the hypothetical assumption with the generative model, and this is not what we want: we want a simple model which supports many queries, rather than a complex model in which only a prescribed set of queries is allowed.

Writing models in Church allows the flexibility to build complex random expressions like this as needed, making assumptions that are phrased as complex propositions, rather than simple observations. Hence the effective number of queries we can construct for most programs will not merely be a large number but countably infinite, much like the sentences in a natural language. The `query` function in principle supports correct conditional inference for this infinite array of situations.

### Example: Reasoning about the Tug of War

Returning to the earlier example of a series of tug-of-war matches, we can use query to ask a variety of different questions. For instance, how likely is it that bob is strong, given that he's been in a series of winning teams?

Try varying the number of different teams and teammates that bob plays with. How does this change the probability that bob is strong? Do these changes agree with your intuitions? Can you modify this example to draw `strength` from a uniform distribution, instead of having just two possible values?

We can form many complex queries from this simple model. We could ask how likely a team of bob and mary is to win over a team of jim and sue, given that mary is at least as strong as sue, and bob was on a team that won against jim previously:

As an exercise, let's go back to the tug-of-war tournament described under Generative Models and write a Church program to infer the probability that alice is stronger than bob, given a particular tournament outcome.