# From Causal Dependence to Statistical Dependence

Our probabilistic programs encode knowledge about the world in the form of causal models, and it is useful to understand how their function relates to their structure by thinking about some of the intuitive properties of causal relations. Causal relations are local, modular, and directed. Any two arbitrary events in the world are most likely causally unrelated, or independent. If they are related, or dependent, the relation is only very weak and liable to be ignored in our mental models. Many events that are related are not related directly: They are connected only through causal chains of several steps, a series of intermediate and more local dependencies. And the basic dependencies are asymmetric: when we say that A causes B, it means something different than saying that B causes A. Information flows both ways along a causal relation—we expect that learning about either event will give us information about the other—but causal influence flows only one way: we expect that manipulating A will change the state of B, but not vice versa.

Let's examine this notion of "causal dependence" a little more carefully. What does it mean to believe that A depends causally on B? Viewing cognition through the lens of probabilistic programs, the most basic notions of causal dependence are in terms of the structure of the program that expresses a causal model and the flow of evaluation (or "control") in its execution. We say that expression A causally depends on expression B if it is necessary to evaluate B in order to evaluate A. (More precisely, if it is not always possible to determine the value of expression A without determining the value of expression B.) Note that this causal dependence order is weaker than a notion of ordering in time, giving only a partial order on expressions. (This notion of causal dependence is roughly the same as the notion of control flow dependence in the functional programming language literature.) For example, consider a simpler variant of our medical diagnosis scenario:

Here, cough depends causally on both lung-disease and cold, while fever depends causally on cold but not lung-disease. We can see that cough depends causally on smokes but only indirectly: although cough does not call smokes directly, in order to evaluate whether a patient coughs, we first have to evaluate the expression lung-disease that must itself evaluate smokes. (The notion of "direct" causal dependence is a bit vague: do we want to say that cough depends directly on cold, or only directly on the expression (or (and cold (flip 0.5)) ...? This can be resolved in several ways that all result in similar intuitions.) The causal dependence structure is not always immediately clear from examining a program, particularly where there are complex functions calls; a way of diagnosing causal dependence, which is even sometimes taken as the definition, is by manipulation—see the sidebar.

Sidebar: Another way to detect (or according to some philosophers, such as Jim Woodward, to define) causal dependence is more functional, in terms of "difference making": If we manipulate A, does B tend to change? The code above queries whether a patient is likely to have a cold or a cough a priori, conditioned on no observations. See what happens if we give our hypothetical patient a cold, e.g., by exposing him to a strong cocktail of cold viruses. Formally we implement this not by observing cold (conditioning on having a cold in the Church query) but by forcing it in the definition: (define cold true). You should see that conditional inferences about cough change: coughing becomes more likely if we know that a patient has been given a cold by external intervention. But the reverse is not true. If we force the patient to have a cough, e.g., with some unusual drug or by exposure to some cough-inducing dust -- or formally, by writing (define cough true), conditional inferences about cold are unaffected. Here is a more familiar version of this point: we know that treating the symptoms of a disease directly doesn't cure the disease (taking cough medicine doesn't make your cold go away), but treating the disease does relieve the symptoms. Verify in the Church query above that the same notion of causation as asymmetric difference-making holds for functions that are only indirectly dependent: e.g., force a patient to smoke and show that it increases their probability of coughing, but not vice versa. If we are given the program code for a causal model, and the model is simple enough, it is straightforward to read off causal dependencies from the structure of function calls in the program. However, the notion of causation as difference-making may be easier to compute in much larger, more complex models -- and it does not require an analysis of the program code. As long as we can modify (or imagine modifying) the definitions of functions in the program and query the resulting model, we can compute whether two events or functions are causally related by the difference-making criterion. This sort of inference is known in the causal Bayesian network literature as the "do operator" or graph surgery (Pearl). It is also the basis for interesting theories of counterfactual reasoning by Pearl and colleagues (Halpern, Hitchcock and others).

There are several special situations that are worth mentioning. In some cases, whether expression A requires expression B will depend on the value of some third expression C. For example, here is a particular way of writing a noisy-logical AND gate:

(define C (flip))
(define B (flip))
(define A (if C (if B (flip 0.85) false) false)


A always requires C, but (under a lazy or "by need" approach to evaluation) only evaluates B if C returns true. Under the above definition of causal dependence A depends on B (as well as C). However, one could imagine a more fine-grained notion of causal dependence that would be useful here: we could say that A depends causally on B only in certain contexts (just those where C happens to return true and thus A calls B).

When an expression occurs inside a function body it may get evaluated several times in a program execution. In such cases it is useful to speak of causal dependence between specific evaluations of two expressions. (However, note that if a specific evaluation of A depends on a specific evaluation of B, then any other specific evaluation of A will depend on some specific evaluation of B.)

One often hears the warning, "correlation does not imply causation". By "correlation" we mean a different kind of dependence between events or functions, sometimes called "statistical dependence" to distinguish it from "causal dependence". The fact that we need to be warned against confusing these notions suggests they are related, and indeed, they are. One might even say they are "causally related", in the sense that causal dependencies cause, or give rise to, statistical dependencies. In general, if A causes B, then A and B will be statistically dependent. Statistical dependence is a symmetric relation between events referring to how information flows between them when we observe or reason about them. If A and B are statistically dependent, then learning about A tells us something about B, and vice versa. In the language of Church: using query to assert something about A changes the value expected for B, and vice versa. Using the above code box (press Reset if necessary to undue your last changes), verify that this sort of informational dependence is symmetric, unlike causal dependence. For instance, conditioning on cold should affect the conditional distribution on cough, and also vice versa. Verify also that statistical dependence holds symmetrically for events that are connected by an indirect causal chain, such as smokes and coughs.

Correlation is not just a symmetrized version of causality, of course. Two events may be statistically dependent even if there is no causal chain running between them, as long as they have a common cause (direct or indirect). In the language of Church models, two expressions will be statistically dependent if one calls the other, directly or indirectly, or if they both at some point in their evaluation histories refer to some other expression. In the example above, cough and fever are not causally dependent but they are statistically dependent, because they both call cold; likewise for chest-pain and shortness-of-breath which both call lung-disease. Here we can read off these facts from the program definitions, but more generally all of these relations can be diagnosed by reasoning using query. (Exercise: Show this in the above example by modifying the above.)

Successful learning and reasoning with causal models typically depends on exploiting the close coupling between causation and correlation. This is because causal relations are typically unobservable, while correlations are observable from data. Noticing patterns of correlation is thus often the beginning of causal learning, or discovering what causes what. With a causal model already in place, reasoning about the statistical dependencies implied by the model allows us to predict many aspects of the world not directly observed from those aspects we do observe. When the time comes to act, though -- to manipulate the objects in our world -- then we must be careful not to confuse statistical and causal dependence.

# From A Priori Dependence to Conditional Dependence

The relationships between causal structure and correlational structure, or statistical dependence, become particularly interesting and subtle when we look at the effects of observations. Events that are statistically dependent a priori (sometimes called marginally dependent) may become independent when we condition on some other observations; this is often called screening off, or sometimes context-specific independence. Also, events that are statistically independent a priori (marginally independent) may become dependent when we condition on other observations; this is known as explaining away. The dynamics of screening off and explaining away are extremely important for understanding patterns of inference -- reasoning and learning -- in causal models.

## Screening off

"Screening off" refers to a pattern of statistical inference that is quite common in both scientific and intuitive reasoning. If the statistical dependence between two events A and B is only indirect, mediated strictly by one or more other events C, then conditioning on (observing) C should render A and B statistically independent. This can occur if A and B are connected by one or more causal chains, and all such chains run through the set of events C, or if C comprises one or more common causes of A and B. As an example in our simple medical scenario, smokes is correlated with (statistically dependent on) several symptoms, cough, chest-pain, and shortness-of-breath, due to the causal chain between them mediated by lung-disease. We can see this easily by conditioning on these symptoms and querying smokes:

The conditional probability of smokes is much higher than the base rate, 0.2, because observing all these symptoms gives strong evidence for smoking. See how much evidence the different symptoms contribute by dropping them out of the conditioning set. For instance, condition the above query on (and cough chest-pain), or just cough; you should observe the probability of smokes decrease as fewer symptoms are observed.

Now, suppose we condition also on knowledge about the function that mediates these causal links: lung-disease. Is there still an informational dependence between these various symptoms and smokes, conditioned on knowing that lung-disease is true or false? In the query below, try adding and removing various symptoms (cough, chest-pain, shortness-of-breath) but maintaining the observation lung-disease (note: it can be tricky to diagnose statistical independence using query, since natural variation due to random sampling can look like differences between conditions):

Try the same experiment of varying the symptom patterns for cough, chest-pain, shortness-of-breath while maintaining the observation (not lung-disease):

You should now see an effect of whether or not the patient has lung disease on conditional inferences about smoking -- a person is judged to be about substantially more likely to be a smoker if they lung disease than otherwise -- but there is no separate effects of chest pain, shortness of breath or cough, over and above the evidence provided by knowing whether the patient has lung-disease. We say that the intermediate variable lung disease screens off the root cause of smoking from the more distant effects of coughing, chest pain and shortness of breath.

Screening off as defined here is a purely statistical phenomenon. When we observe C, the event(s) that mediate an indirect causal relation between A and B, A and B are still causally dependent in our model of the world: it is just our states of knowledge about A and B that become uncorrelated. There is also an analogous causal phenomenon. If we can actually manipulate or intervene on the causal system, and set the value of C to some known value, then A and B become both statistically and causally independent. Try this out on the above Church program using (define lung-cancer true).

## Explaining away

"Explaining away" [1] refers to a complementary pattern of statistical inference which is somewhat more subtle than screening off. If two events A and B are statistically (and hence causally) independent, but they are both causes of one or more other events C, then conditioning on (observing) C will in general render A and B statistically dependent. As with screening off, we are only talking about inducing statistical dependence here, not causal dependence: when we observe C, A and B remain causally independent in our model of the world; it is just our states of knowledge about A and B that become correlated.

Here are is a concrete example in our medical scenario. A priori, having a cold and having lung disease are independent causally and statistically. But because they are both causes of some of the same symptoms, like coughing, if we observe cough the functions cold and lung-disease become statistically dependent. That is, learning something about whether a patient has cold or lung-disease will, in the presence of their common effect cough, convey information about the other condition. We say that cold and lung-cancer are marginally (or unconditionally) independent, but conditionally dependent given cough.

To illustrate, observe how the probabilities of cold and lung-disease change when we observe cough is true.

Both cold and lung disease are now far more likely that their baseline probability: the probability of having a cold increases from 2% to around 40%; the probability of having lung disease increases from 1 in a 1000 to a few percent. Given a cough, cold is the much more likely explanation simply because it was much more probable a priori.

Now suppose we learn that the patient does not have a cold.

Now the probability of having lung disease increases dramatically. If instead we had observed that the patient does have a cold, the probability of lung cancer returns to its very low base rate of 1 in a 1000.

This is the conditional informational dependence between lung disease and cold, given cough: Learning that the patient does in fact have a cold "explains away" the observed cough, so the alternative of lung disease decreases to a much lower value - roughly back to its 1 in a 1000 in the general population. If on the other hand, we had learned that the patient does not have a cold, so the most likely alternative to lung disease is not in fact available to "explain away" the observed cough, that raises the conditional probability of lung disease dramatically. As an exercise, check that if we remove the observation of coughing, the observation of having a cold or not having a hold has no influence on our belief about lung disease; this effect is purely conditional on the observation of a common effect of these two causes.

If we consider a patient with a cough who also smokes, then cold and lung disease are roughly equally likely explanations.

Or add the observation that the patient has chest pain, so lung disease becomes an even more probable condition than having a cold. These are the settings where explaining away effects will be strongest. Modify the above program to observe that the patient either has a cold or does not have a cold, in addition to having a cough, smoking, and perhaps having chest-pain. E.g., compare these conditioners:

 (and smokes cough) with (and smokes cough cold) or
(and smokes cough (not cold))

 (and smokes chest-pain cough) with (and smokes chest-pain cough cold) or
(and smokes chest-pain cough (not cold))


Notice how far up or down knowledge about whether the patient has a cold can push the conditional belief in having lung disease.

Explaining away effects can be more indirect. Instead of observing the truth value of cold, an alternative cause of cough besides lung-disease, we might simply observe another symptom that provides evidence for cold, such as fever. Compare these conditioners with the above church program to see an "explaining away" conditional dependence in belief between fever and lung-disease.

 (and smokes chest-pain cough) with (and smokes chest-pain cough fever) or
(and smokes chest-pain cough (not fever))


In this case, finding out that the patient either does or does not have a fever makes a crucial difference in whether we think that the patient has lung disease... even though fever itself is not at all diagnostic of lung disease, and there is no causal connection between them.

We can express the general phenomenon of explaining away with the following schematic Church query:

(query
(define a ...)
(define b ...)
...
(define data (... a... b...))

b

(and (equal? data some-value) (equal? a some-other-value)))


We have defined two independent variables a and b both of which are used to define the value of our data. If we condition on the data and a the posterior distribution on b will now be dependent on a: observing additional information about a changes our conclusions about b.

The most typical pattern of explaining away we see in causal reasoning is a kind of anti-correlation: the probabilities of two possible causes for the same effect increase when the effect is observed, but they are conditionally anti-correlated, so that observing additional evidence in favor of one cause should lower our degree of belief in the other cause. However, the coupling in belief states induced by conditioning on common effects is not always an anti-correlation. That depends on the nature of the interaction between the causes.

Explaining away takes the form of an anti-correlation when the causes interact in a roughly disjunctive or additive form: e.g., the effect tends to happen if cause A or cause B produce it; or the effect happens if the sum of cause A's and cause B's continuous influences exceeds some threshold. The following simple mathematical examples show some of the other possibilities. Suppose we condition on observing the sum of two integers drawn uniformly from 0 to 9.

This gives perfect anti-correlation in conditional inferences for A and B. But suppose we instead condition on observing that A and B are equal.

Now, of course, A and B go from being independent a priori to being perfectly correlated in the conditional distribution. Try out these other conditioners to see other possible patterns of conditional dependence for a priori independent functions.

(< (abs (- A B)) 2)

(and (>= (+ A B) 9) (<= (+ A B) 11))

(equal? 3 (abs (- A B)))

(equal? 3 (modulo (- A B) 10))

(equal? (modulo A 2) (modulo B 2))

(equal? (modulo A 5) (modulo B 5))


# Non-monotonic Reasoning

One reason explaining away is an important phenomenon in probabilistic inference is that it is an example of non-monotonic reasoning. In formal logic, a theory is said to be monotonic if adding an assumption (or formula) to the theory never reduces the set of consequences of the previous assumptions. Most traditional logics (e.g. First Order) are monotonic, but human reasoning does not seem to be. For instance, if I tell you that tweety is a bird, you conclude that he can fly; if I now tell you that tweety is an ostrich you retract the conclusion that he can fly. Over the years many non-monotonic logics have been introduced to model aspects of human reasoning. One of the first ways in which probabilistic reasoning with Bayesian networks was recognized as important for AI was that it could perspicuously capture these patterns of reasoning.

Another way to think about monotonicity is by considering the trajectory of our belief in a specific proposition, as we gain additional relevant information. In traditional logic, there only one three states of belief: true, false, and unknown (when neither a proposition nor its negation can be proven). As we learn more about the world, maintaining logical consistency requires that our belief in any proposition only move from unknown to true or false. That is our "confidence" in any conclusion only increases (and only does so in one giant leap from unknown to true or false).

In a probabilistic approach, by contrast, there is a spectrum of degrees of belief. We can think of confidence as a measure of how far our beliefs are from a uniform distribution, or how close to the extremes of 0 or 1. In probabilistic inference, unlike in traditional logic, our confidence in a proposition can both increase and decrease. As we will see in the next example, even fairly simple probabilistic models can induce complex explaining-away dynamics that lead our degree of belief in a proposition to reverse directions multiple times as the conditioning set expands.

Often we have to make inferences about different types of entities and their interactions, and a highly interactive set of relations between the entities leads to very challenging explaining away problems. Inference is computationally difficult in these situations but the inferences come very naturally to people, suggesting these are important problems are brains have specialized somewhat to solve.

A familiar example comes from reasoning about the causes of students' patterns of success and failure in the classroom. Imagine yourself in the position of an interested outside observer -- a parent, another teacher, a guidance counselor or college admissions officer -- in thinking about these conditional inferences. If a student doesn't pass an exam, what can you say about why he failed? Maybe he doesn't do his homework, maybe the exam was an unfair, or maybe he was just unlucky?

Now what if you have evidence from several students and several exams?

Initially we observe that Bill failed exam 1. A priori, we assume that most students do their homework and most exams are fair, but given this one observation it becomes somewhat likely that either the student didn't study or the exam was unfair.

See how conditional inferences about Bill and exam 1 change as you add in more data about this student or this exam, or additional students and exams. Paste each of the data sets below into the above Church box, as a substitute for the current conditioner (not (pass 'bill 'exam1)). Try to explain the dynamics of inference that result at each stage. What does each new piece of the larger data set contribute to your intuition about Bill and exam 1?

 (and (not (pass? 'bill 'exam1)) (not (pass? 'bill 'exam2)))

 (and (not (pass? 'bill 'exam1))
(not (pass? 'mary 'exam1))
(not (pass? 'tim 'exam1)))

(and (not (pass? 'bill 'exam1)) (not (pass? 'bill 'exam2))
(not (pass? 'mary 'exam1))
(not (pass? 'tim 'exam1)))

 (and (not (pass? 'bill 'exam1))
(not (pass? 'mary 'exam1)) (pass? 'mary 'exam2) (pass? 'mary 'exam3) (pass? 'mary 'exam4) (pass? 'mary 'exam5)
(not (pass? 'tim 'exam1)) (pass? 'tim 'exam2) (pass? 'tim 'exam3) (pass? 'tim 'exam4) (pass? 'tim 'exam5))

 (and (not (pass? 'bill 'exam1))
(pass? 'mary 'exam1)
(pass? 'tim 'exam1))

 (and (not (pass? 'bill 'exam1))
(pass? 'mary 'exam1) (pass? 'mary 'exam2) (pass? 'mary 'exam3) (pass? 'mary 'exam4) (pass? 'mary 'exam5)
(pass? 'tim 'exam1) (pass? 'tim 'exam2) (pass? 'tim 'exam3) (pass? 'tim 'exam4) (pass? 'tim 'exam5))

 (and (not (pass? 'bill 'exam1)) (not (pass? 'bill 'exam2))
(pass? 'mary 'exam1) (pass? 'mary 'exam2) (pass? 'mary 'exam3) (pass? 'mary 'exam4) (pass? 'mary 'exam5)
(pass? 'tim 'exam1) (pass? 'tim 'exam2) (pass? 'tim 'exam3) (pass? 'tim 'exam4) (pass? 'tim 'exam5))

 (and (not (pass? 'bill 'exam1)) (not (pass? 'bill 'exam2)) (pass? 'bill 'exam3) (pass? 'bill 'exam4) (pass? 'bill 'exam5)
(not (pass? 'mary 'exam1)) (not (pass? 'mary 'exam2)) (not (pass? 'mary 'exam3)) (not (pass? 'mary 'exam4)) (not (pass? 'mary 'exam5))
(not (pass? 'tim 'exam1)) (not (pass? 'tim 'exam2)) (not (pass? 'tim 'exam3)) (not (pass? 'tim 'exam4)) (not (pass? 'tim 'exam5)))


## A Case Study in Modularity: Visual Perception of Surface Lightness and Color

Visual perception is full of rich conditional inference phenomena, including both screening off and explaining away. Some very impressive demonstrations have been constructed using the perception of surface structure by mid-level vision researchers; see the work of Dan Kersten, David Knill, Ted Adelson, Bart Anderson, Ken Nakayama, among others. Most striking is when conditional inference appears to violate or alter the apparently "modular" structure of visual processing. Neuroscientists have developed an understanding of the primate visual system in which processing for different aspects of visual stimuli -- color, shape, motion, stereo -- appears to be at least somewhat localized in different brain regions. This view is consistent with findings by cognitive psychologists that at least in early vision, these different stimulus dimensions are not integrated but processed in a somewhat modular fashion. Yet vision is at heart about constructing a unified and coherent percept of a three-dimensional scene from the patterns of light falling on our retinas. That is, vision is causal inference on a grand scale. Its output is a rich description of the objects, surface properties and relations in the world that are not themselves directly grasped by the brain but that are the true causes of the retinal stimulus. Solving this problem requires integration of many appearance features across an image, and this results in the potential for massive effects of explaining away and screening off.

In vision, the luminance of a surface depends on two factors, the illumination of the surface (how much light is hitting it) and its reflectance. The actual luminance is the product of the two factors. Thus luminance is inherently ambiguous. The visual system has to determine what proportion of the luminance is due to reflectance and what proportion is due to the illumination of the scene. This has led to a famous illusion known as the checker shadow illusion discovered by Ted Adelson.

The illusion results from the fact that in the image above both the square labeled A and the square labeled B are actually the same shade of gray. This can be seen in the figure below where they are connected by solid gray bars on either side.

What is happening here is that the presence of the cylinder is providing evidence that the illumination of square B is actually less than that of square A. Thus we see perceive square B as having higher reflectance since its luminance is identical to square A despite the fact that we believe there is less light hitting it. The following program implements a simple version of this scenario "before" we see the shadow cast by the cylinder.

Here we have introduced a third kind of primitive random procedure, gaussian which outputs real numbers, in addition to flip (which outputs binary truth values) and beta (which outputs numbers in the interval [0,1]). gaussian implements the well-known Gaussian or normal distribution. It takes two parameters: a mean, $\mu$, and a variance, $\sigma^2$.

Also, note that here we have defined the helper function noisy= using the abstraction operator lambda. We will discuss this operator in more detail later in the course.

Now let's condition on the presence of the cylinder.

Conditional inference takes into account all the different paths through the generative process that could have generated the data. Two variables on different causal paths can thus become dependent once we condition on the way the data came out. The important point is that the variables reflectance and illumination are conditionally independent in the generative model, but after we condition on luminance they become dependent: changing one of them affects the probability of the other. This phenomenon has important consequences for cognitive science modeling. Although our model of our knowledge of the world and language have a certain kind of modularity implied by conditional independence, as soon as we start using the model to do conditional inference on some data (e.g. parsing, or learning the language), formerly modularly isolated variables can become dependent.

### Other vision examples (to be developed)

Kersten's "colored Mach card" illusion is a beautiful example of both explaining away and screening off in visual surface perception, as well as a switch between these two patterns of inference conditioned on an auxiliary variable.

Depending on how we perceive the geometry of a surface folded down the middle -- whether it is concave so that the two halves face each other or convex so that they face away -- the perceived colors of the faces will change as the visual system either discounts (explains away) or ignores (screens off) the effects of inter-reflections between the surfaces.

The two cylinders illusion of Kersten is another nice example of explaining away. The gray shading patterns are identical in the left and right images, but on the left the shading is perceived as reflectance difference, while on the right (the "two cylinders") the same shading is perceived as due to shape variation on surfaces with uniform reflectance.

(This image is from Kersten, Mamassian and Yuille, Annual Review of Psychology 2004)

1. Pearl, J. Probabilistic Reasoning in Intelligent Systems, San Mateo: Morgan Kaufmann, 1988.