Interlude: Probability Theory and The Meaning of Probabilistic Programs
- To return to the top level: Probabilistic Models of Cognition Tutorial.
By Timothy J. O'Donnell, Noah D. Goodman, Andreas Stuhlmueller, and the Church Working Group.
In order to gain a deeper understanding of Church and the relationship between computation and probabilities, it is useful to explore the Church eval function. More detail on that can be found here: Computation in Church
Contents |
Probability Notation
Typically in probability theory we talk about joint distributions, the distribution over sets of random variables. If we are interested in random variables <math>A</math> and <math>B</math> we notate the joint distribution as:
- <math>P(A, B)</math>
Defining the complete joint distribution means specifying the probability over every combination of values that <math>A</math> and <math>B</math> can take. For instance, if each variable <math>A</math> and <math>B</math> is a boolean variable then we can define the joint distribution by means of a table such as the following.
<math>B=</math>true | <math>B=</math>false | |
---|---|---|
<math>A=</math>true | .1 | .3 |
<math>A=</math>false | .5 | .1 |
Marginalization and Independence
The marginal distributions <math>P(A)</math> and <math>P(B)</math> can be derived via marginalization. This is the process of summing the values of the probabilities for each variable over the values of the other variable. When we represent the joint distribution as a table like this, this amounts to summing over rows and columns, noting the sums in the margins of the table—hence the name.
<math>B=</math>true | <math>B=</math>false | <math>P(A)</math> | |
---|---|---|---|
<math>A=</math>true | .1 | .3 | .4 |
<math>A=</math>false | .5 | .1 | .6 |
<math>P(B)</math> | .6 | .4 |
In general, the distribution on values of one random variable can depend on the distribution of values of many other random variables in complex ways. In the worst case, we might have to define the joint distribution for each and every combination of values one at a time as we did above. However, the structure of most probabilistic models includes various conditional independence assumptions. For example, if we are in a model where all the variable are independent then the joint distribution factorizes into the product of all the marginals.
- <math>P(A, B)=P(A)P(B)</math>
<math>B=</math>true | <math>B=</math>false | <math>P(A)</math> | |
---|---|---|---|
<math>A=</math>true | <math>.1=P(A=true)P(B=true)</math> | <math>.4=P(A=true)P(B=false)</math> | <math>.5</math> |
<math>A=</math>false | <math>.1=P(A=false)P(B=true)</math> | <math>.4=P(A=false)P(B=false)</math> | <math>.5</math> |
<math>P(B)</math> | .2 | .8 |
Marginalization and Independence in Church
What is the set of random variables defined by a Church program? In a Church program, every evaluation of an expression is a random variable. We have already seen examples where this leads to models that sample an unbounded number of random variables. How are conditional independencies expressed in Church? The conditional independence assumptions made by a Church program are partially expressed by the program's computation trace. The computation trace is a graph that represents the process of evaluating a Church program. A particular node in the trace, corresponding to the evaluation of an expression (or the application of a procedure) is dependent on all of the nodes which it dominates. Consider the simple probabilistic bi-gram model below.
{{#churchserv: (debug-mode 'pretty-pictures true) (define (unfold-N N start transition)
(if (= N 0) '() (pair start (unfold-N (- N 1) (transition start) transition))))
(define (transition symbol)
(case symbol (('start) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) (('a) (multinomial '(b c d) '(1/3 1/3 1/3))) (('b) (multinomial '(a c d) '(1/3 1/3 1/3))) (('c) (multinomial '(b a d) '(1/3 1/3 1/3))) (('d) (multinomial '(b c a) '(1/3 1/3 1/3))) (else 'a)))
(unfold-N 2 'start transition) }}
The unfold-N recursion build up a list as it goes. The values corresponding to these partial lists are represented at internal application nodes in the computation trace. These values are in the square boxes running up the left-hand side of the trace. These partial results only depend on the computation under them. In particular these values only depend on the evaluations of multinomial which returned symbols contained in the partial solutions and which they dominate.
As mentioned, the set of random variables sampled in an execution of a Church program consists of the entire set of evaluated expressions—nodes in the computation trace. Because of this, the notion of marginalization becomes especially important in the Church setting. What does it mean to marginalize away a random variable in a sampling setting? It means simply to ignore it. For instance, in the following program we are only interested in a single random variable, the value sampled by the HMM in expression (unfold-hmm-N 3 'start transition observation).
{{#churchserv: (debug-mode 'pretty-pictures true) (define (unfold-hmm-N N start transition observation)
(if (= N 0) '() (pair (observation start) (unfold-hmm-N (- N 1) (transition start) transition observation))))
(define (transition state)
(cond ((equal? state 'start) (multinomial '(1 2 3 4) '(1/4 1/4 1/4 1/4))) ((equal? state 1) (multinomial '(2 3 4) '(1/3 1/3 1/3))) ((equal? state 2) (multinomial '(1 3 4) '(1/3 1/3 1/3))) ((equal? state 3) (multinomial '(1 2 4) '(1/3 1/3 1/3))) ((equal? state 4) (multinomial '(1 2 3) '(1/3 1/3 1/3))) (else 1)))
(define (observation state)
(cond ((equal? state 'start) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) ((equal? state 1) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) ((equal? state 2) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) ((equal? state 3) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) ((equal? state 4) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) (else 'a)))
(unfold-hmm-N 3 'start transition observation) }}
Suppose that we sampled the result '(a b a). Notice that there are many possible ways of sampling this sequences of symbols from our transition matrix. We could have sampled the state sequence '(1 2 3) to get this sequence of symbol or the state sequence '(1 3 1), or many others. The Church program defines the joint distribution over all of these possibilities. Each computation trace has a probability score associated with it. That number is the score of the specific way in which the value was computed in that particular evaluation. However, by ignoring all the other variable in the Church program and only focusing on the value of interest—the final result, we marginalize these variables away. Imagine running this Church program repeatedly and keeping a histogram of the number times each possible value of the final expression came up. In the limit this histogram would converge to the true marginal distribution of this expression, marginalizing away all the other random variables (expressions) sampled during execution including the particular states. By collapsing all the traces with the same final value into bin, we sum across all the variables we are not interested in.
Degrees of Belief: Bayesian Interpretation of Probability
At this point it is convenient to emphasize an aspect of probabilistic modeling that seems deceptively trivial, but will come up repeatedly when we talk about inference. In Bayesian statistics we think of probabilities as being degrees of belief. Our generative models reflect world knowledge and the probabilities that we assign to the possible sampled values reflect how strongly we believe in that possibility. The laws of probability theory ensure that our beliefs remain consistent as we reason.
A consequence of coherent belief maintenance is known as the Law of Conservation of Belief (LCB). There are many ways to state the law of conservation of belief, and, though simple, it has far reaching effects. Here are some equivalent formulations.
- Sampling from a distribution selects exactly one possibility (in doing so it implicitly rejects all other possible values).
- The total probability mass of a distribution must sum to <math>1</math>. That is, we only have a single unit of belief to spread around.
- We can usefully think of belief as a "currency" that is "spent" by the probabilistic choices required to construct a sample. Since each choice that could come out different requires "spending" some currency, an outcome that requires more choices to construct is will generally be more costly, i.e. less believable. Thus the law of conservation of belief acts as a sort of simplicity constraint.