The Meaning of Probabilistic Programs
- To return to the top level: ESSLLI Tutorial.
Contents |
Distributions and Probability Notation
A probability distribution is a map from a discrete space of "events" to <math>[0,1]</math> with integral 1. (Over a continuous space one considers probability densities instead.) The probability of a given event <math>a \in A</math> is often written <math>P(A=a)</math> and the distribution is often written <math>P(A)</math>. When used in a concrete situation a probability distribution <math>P(A)</math> is often called a random variable. (Technically a probability distribution is a push-forward measure, see [1].)
In Bayesian statistics we generally think of probabilities as being degrees of belief. Our generative models reflect world knowledge and the probabilities that we assign to the possible sampled values reflect how strongly we believe in that possibility. So <math>P(A=a)</math> represents our degree of belief in the random variable having value <math>a</math>.
The deceptively simple requirement that probabilities the total probability mass of a distribution must sum to <math>1</math> (that is, we only have a single unit of "belief" to spread around), has far reaching consequences, such as the Bayesian Occam's Razor. We sometimes call this unit belief requirement the Law of Conservation of Belief (LCB). From the point of view of sampling, this requirement follows naturally: Sampling from a distribution selects exactly one possibility (in doing so it implicitly rejects all other possible values).
We can usefully think of belief as a "currency" that is "spent" by the probabilistic choices required to construct a sample. Since each choice that could come out different requires "spending" some currency, an outcome that requires more choices to construct is will generally be more costly, i.e. less believable. Thus the law of conservation of belief implies a sort of simplicity constraint.
The Duality Between Probabilistic Programs and Computable Distributions
We have informally described Church programs as defining probability distributions on possible output values, and hinted that any probability distribution can be represented this way. This duality theorem is true, though with some restrictions on both sides: any Church program that halts almost always defines a distribution on its output values, and any computable probability distribution can be represented by a Church program. The details of this correspondence in the general setting are subtle (see Freer and Roy, AISTATS 2010, and references therein).
Distributions on <math>\mathbb N</math>
In the case of distributions over natural numbers the duality is more straightforward. We begin by defining a computable distribution using a computable (lazy) representation of real numbers to capture probabilities.
Definition: A lazy probability is a function <math>a: {\mathbb N} \rightarrow \{0,1\}</math>, where <math>a(i)</math> represents the ith binary digit of <math>a</math>.
Definition: A computable probability distribution on the natural numbers is a function <math>P: {\mathbb N} \rightarrow ( {\mathbb N} \rightarrow \{0,1\}) </math> from natural numbers to lazy probabilities.
Thus, as a Church function a computable probability distribution looks schematically like:
(define (dist n) (lambda (i) ....))
Definition: Let sampler
be a Church thunk and dist
be a computable probability distribution dist
on the natural numbers. We say that dist
is induced by sampler
if: <math>lim_{S\rightarrow\infty}</math>(mean (repeat S (lambda () (= n (sampler)))))
converges almost surely to (dist n)
for all n
. If dist
is induced by sampler
we say that sampler
samples from dist
.
Theorem: For any Church thunk sampler
that halts almost-always, and returns a natural number whenever it halts, there exists a computable probability distribution on the natural numbers induced by sampler
.
Proof (sketch): A non-constructive proof follows from the law of large numbers. (This doesn't show that the induced distribution is computable, which requires constructing a procedure from sampler
which computes the probability of a return value to a given precision; this can be done by lazily enumerating possible evaluation paths of sampler
.)
Theorem: For any computable probability distribution dist
on the natural numbers there exists a Church thunk sampler
that samples from dist
.
Proof: We first give a higher-order procedure that constructs a sampling procedure from a computable probability distribution. We then show that this procedure halts almost-always and that the constructed sampler does in fact sample from the distribution.
(define (sampler dist) (lambda () (find-x (lazy-uniform) dist 1))) (define (lazy-uniform) (mem (lambda (digit) (if (flip) 1 0)))) (define (find-x y dist min-x) (if (lazy-< (cumsum dist min-x) y) (find-x y dist (+ min-x 1)) min-x)) (define (cumsum dist n) (if (= n 0) lazy-0 (lazy-+ (dist n) (cumsum dist (- n 1))))) (define (lazy-< a b) (inner-lazy-< a b 1)) (define (inner-lazy-< a b start-digit) (if (eq? (a start-digit) (b start-digit)) (inner-lazy-< a b (+ start-digit 1)) (eq? 1 (b start-digit)))) (define (lazy-+ a b) (lambda (digit) (mod (+ (a digit) (b digit) (cary a b (- digit 1))) 2))) (define (cary a b digit) (if (= digit 0) false (if (< 1 (+ (a digit) (b digit) (cary a b (- digit 1)))) 1 0)))
(find-x y dist min-x) finds the greatest natural number x (greater than min-x) such that the total probability of {1..x} is less than y.
To show that constructed sampler function halts almost always, it is only necessary to show that each recursive function in the above definition does so. cary
and cumsum
recurse down an integer argument to base case 0, so they halt. find-x
will halt almost always if lazy-<
almost always halts and if for any y
(lazy-< (cumsum dist min-x) y)
is almost always true for some min-x
.
(lazy-< a b)
will halt unless (= (a i) (b i))
for all i
. Because one argument to lazy-<
is always a sample from lazy-uniform
the two arguments are equal with probability <math>lim_{i\rightarrow\infty}(0.5)^i = 0</math>.
(TODO: Show that the sampler has the right expectation.)
Probability Calculus
In statistics we are often interested in joint distributions, the distribution over sets of random variables. If we are interested in random variables <math>A</math> and <math>B</math> we notate the joint distribution as:
- <math>P(A, B)</math>
Defining the complete joint distribution means specifying the probability over every combination of values that <math>A</math> and <math>B</math> can take. For instance, if each variable <math>A</math> and <math>B</math> is a boolean variable then we can define the joint distribution by means of a table such as the following.
<math>B=</math>true | <math>B=</math>false | |
---|---|---|
<math>A=</math>true | .1 | .3 |
<math>A=</math>false | .5 | .1 |
Marginalization and Independence
The marginal distributions <math>P(A)</math> and <math>P(B)</math> can be derived via marginalization. This is the process of summing the values of the probabilities for each variable over the values of the other variable. When we represent the joint distribution as a table like this, this amounts to summing over rows and columns, noting the sums in the margins of the table—hence the name.
<math>B=</math>true | <math>B=</math>false | <math>P(A)</math> | |
---|---|---|---|
<math>A=</math>true | .1 | .3 | .4 |
<math>A=</math>false | .5 | .1 | .6 |
<math>P(B)</math> | .6 | .4 |
In general, the distribution on values of one random variable can depend on the distribution of values of many other random variables in complex ways. In the worst case, we might have to define the joint distribution for each and every combination of values one at a time as we did above. However, the structure of most probabilistic models includes various conditional independence assumptions. For example, if we are in a model where all the variable are independent then the joint distribution factorizes into the product of all the marginals.
- <math>P(A, B)=P(A)P(B)</math>
<math>B=</math>true | <math>B=</math>false | <math>P(A)</math> | |
---|---|---|---|
<math>A=</math>true | <math>.1=P(A=true)P(B=true)</math> | <math>.4=P(A=true)P(B=false)</math> | <math>.5</math> |
<math>A=</math>false | <math>.1=P(A=false)P(B=true)</math> | <math>.4=P(A=false)P(B=false)</math> | <math>.5</math> |
<math>P(B)</math> | .2 | .8 |
Marginalization and Independence in Church
What is the set of random variables defined by a Church program? In a Church program, every evaluation of an expression is a random variable. Some models even sample an unbounded number of random variables. How are conditional independencies expressed in Church?
- (FIXME below here is out of date.)
The conditional independence assumptions made by a Church program are partially expressed by the program's computation trace. The computation trace is a graph that represents the process of evaluating a Church program. A particular node in the trace, corresponding to the evaluation of an expression (or the application of a procedure) is dependent on all of the nodes which it dominates. Consider the simple probabilistic bi-gram model below.
{{#churchserv: (debug-mode 'pretty-pictures true) (define (unfold-N N start transition)
(if (= N 0) '() (pair start (unfold-N (- N 1) (transition start) transition))))
(define (transition symbol)
(case symbol (('start) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) (('a) (multinomial '(b c d) '(1/3 1/3 1/3))) (('b) (multinomial '(a c d) '(1/3 1/3 1/3))) (('c) (multinomial '(b a d) '(1/3 1/3 1/3))) (('d) (multinomial '(b c a) '(1/3 1/3 1/3))) (else 'a)))
(unfold-N 2 'start transition) }}
The unfold-N recursion build up a list as it goes. The values corresponding to these partial lists are represented at internal application nodes in the computation trace. These values are in the square boxes running up the left-hand side of the trace. These partial results only depend on the computation under them. In particular these values only depend on the evaluations of multinomial which returned symbols contained in the partial solutions and which they dominate.
As mentioned, the set of random variables sampled in an execution of a Church program consists of the entire set of evaluated expressions—nodes in the computation trace. Because of this, the notion of marginalization becomes especially important in the Church setting. What does it mean to marginalize away a random variable in a sampling setting? It means simply to ignore it. For instance, in the following program we are only interested in a single random variable, the value sampled by the HMM in expression (unfold-hmm-N 3 'start transition observation).
{{#churchserv: (debug-mode 'pretty-pictures true) (define (unfold-hmm-N N start transition observation)
(if (= N 0) '() (pair (observation start) (unfold-hmm-N (- N 1) (transition start) transition observation))))
(define (transition state)
(cond ((equal? state 'start) (multinomial '(1 2 3 4) '(1/4 1/4 1/4 1/4))) ((equal? state 1) (multinomial '(2 3 4) '(1/3 1/3 1/3))) ((equal? state 2) (multinomial '(1 3 4) '(1/3 1/3 1/3))) ((equal? state 3) (multinomial '(1 2 4) '(1/3 1/3 1/3))) ((equal? state 4) (multinomial '(1 2 3) '(1/3 1/3 1/3))) (else 1)))
(define (observation state)
(cond ((equal? state 'start) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) ((equal? state 1) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) ((equal? state 2) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) ((equal? state 3) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) ((equal? state 4) (multinomial '(a b c d) '(1/4 1/4 1/4 1/4))) (else 'a)))
(unfold-hmm-N 3 'start transition observation) }}
Suppose that we sampled the result '(a b a). Notice that there are many possible ways of sampling this sequences of symbols from our transition matrix. We could have sampled the state sequence '(1 2 3) to get this sequence of symbol or the state sequence '(1 3 1), or many others. The Church program defines the joint distribution over all of these possibilities. Each computation trace has a probability score associated with it. That number is the score of the specific way in which the value was computed in that particular evaluation. However, by ignoring all the other variable in the Church program and only focusing on the value of interest—the final result, we marginalize these variables away. Imagine running this Church program repeatedly and keeping a histogram of the number times each possible value of the final expression came up. In the limit this histogram would converge to the true marginal distribution of this expression, marginalizing away all the other random variables (expressions) sampled during execution including the particular states. By collapsing all the traces with the same final value into bin, we sum across all the variables we are not interested in.
query and Conditioning in Probability Theory
In probability theory if we have a joint distribution <math>P(A,B)</math>, we can define the conditional distribution of <math>A</math> given that <math>B</math> has the value <math>b_1</math> by the following formula.
- <math>P(A|B=b_1)=\frac{P(A,B=b_1)}{P(B=b_1)}</math>
Note that the denominator is equal to the probability of <math>B=b_1</math> in the marginal distribution <math>P(B)</math>.
- <math>P(A|B=b_1)=\frac{P(A,B=b_1)}{\sum_A P(A, B=b_1)}</math>
To understand what this means it is useful to refer back to our earlier example of a joint probability distribution.
<math>B=</math>true | <math>B=</math>false | <math>P(A)</math> | |
---|---|---|---|
<math>A=</math>true | .1 | .3 | .4 |
<math>A=</math>false | .5 | .1 | .6 |
<math>P(B)</math> | .6 | .4 |
When we condition on the <math>B=b_1</math> we select the first column in the table above, which corresponds to all of the combinations in the joint distribution where the constraint holds. However, we are interested in the probability of <math>A</math> in this set of circumstances. This means that we need to renormalize so that the probabilities sum to one again. The normalization constant which will make these numbers sum to one, is just the total probability mass present in that column, in other words the marginal probability <math>P(B=b_1)</math>.
<math>B=</math>true | <math>B=</math>false | <math>P(A)</math> | |
---|---|---|---|
<math>A=</math>true | .1666 | .75 | |
<math>A=</math>false | 0.8333333 | .25 | |
<math>P(B)</math> | .6 | .4 |
The important thing to notice about the process of conditionalization is just that it renormalizes the joint probabilities in the part of the space where the condition is true. In other words the relative strength of belief in the outcomes of <math>A</math> which are consistent with the conditioner stays the same. Conditioning is just multiplying by a normalizing constant. This is an example of the law of conservation belief: all else being equal, other than the changes that are implied by our conditioning statement, keep all of our other beliefs the same.
Now consider rejection-query as described in the last section. Remember that it works by sampling from the generative model and throwing away those samples that don't satisfy the conditioner. The samples from the generative model are samples distributed according to the joint distribution over the variables specified by our Church program. By throwing away the samples that are inconsistent with the conditioner we are implicitly renormalizing using just that part of the space that matches the conditioner. In other words, rejection-query is an implementation of conditioning.
Bayes' Rule and query
Often presentations of Bayesian methods begin with a statement of Bayes' rule.
- <math>P(H|D=d) = \frac{P(D=d|H) P(H)}{\sum_{H} P(D=d|H) P(H)}</math>
Where <math>H</math> is a random variable ranging over hypotheses and <math>D</math> is a random variable ranging over data, and we condition on some particular instantiation of a data set. The left-hand side of this equation, <math>P(H|D=d)</math> is called the posterior distribution over hypotheses over data. <math>P(D=d|H)</math> is called the likelihood of the hypothesis, <math>P(H)</math> is called the prior distribution over hypotheses and
- <math>P(D) = \sum_{H} P(D=d|H) P(H)</math>
is called the evidence or marginal likelihood of the data.
The importance of Bayes' rule lies in the fact that while generative models express the distribution <math>P(D|H) P(H)</math>, the induction, learning, or inference problem is usually expressed by the distribution <math>P(H|D)</math>. Bayes' rule gives us formula for linking these distributions with one another. In terms of query Bayes' rule can be thought in the following way.
(query '(define a ...) ;marginalizing over many variables '(define b ... a...) ... '(define o ... a... b...) 'b ;hypothesis '(equal? o data)) ;data
The hypothesis <math>H</math> in Bayes' rule corresponds to whatever random variable is expressed in our query-expression. The data <math>D</math> is encoded in our conditioner. As discussed above a Church program may sample an unbounded number of random variables. Ignoring sampled variables is the Church equivalent of marginalization. Thus a Church query will marginalize away all the variables besides the one corresponding to the query expression. The statement of Bayes' rule above is deceptively simple. The description of Bayes' rule in terms of query shows that the hypothesis corresponds to whatever random variable you are interested in, and the data corresponds to whatever condition you wish to reason hypothetically about. The model that you use to do so can be arbitrarily complex.
Bayes' rule is really just about conditioning, and follows from the definition of a conditional distribution together with the chain rule. The chain rule says that the joint distribution over a set of random variables <math>{R_1, ..., R_n}</math> can always be rewritten in the following way.
- <math>P(R_1, ..., R_n) = P(R_1)P(R_2 | R_1)P(R_3 | R_1, R_2) ... P(R_n|R_1,...R_{n-1}) </math>
In fact, the joint can be rewritten in this way using any ordering of the variables. To see that this is true consider Church marginalization (forgetting) alongside rejection-query. Suppose we have access to a generative model defining the joint distribution <math>P(R_1, ..., R_n) </math>. First, sample a tuple of values for each random variable and forget everything besides the value <math>R_1=r_1</math>. Next use rejection-query to take a sample from the joint, conditional on <math>R_1=r_1</math>. Notice that the probability of <math>P(R_1=r_1)P(R_2=r_2|R_1=r_1)</math> is equal to the original probability of getting <math>P(R_1=r_1,R_2=r_2)</math>, marginalizing over the other variables. This process can be carried on for any number of variables.
The definition of conditional probability tells us that that the following equation holds.
- <math>P(H|D=d) = \frac{P(D=d, H)}{\sum_{H} P(D=d|H) P(H)}</math>
The chain rule gives us Bayes' rule by rewriting the numerator.
- <math>P(H|D=d) = \frac{P(D=d|H)P(H)}{\sum_{H} P(D=d|H) P(H)}</math>
The general situation expressed by query is the following.
<math>P(R_i |R_j=r) = \sum_{R_{l \neq i,j}} \left [ \frac{P(R_1, ..., R_j=r,..., R_n)}{\sum_{R_{k \neq j}} P( R_1, ..., R_j=r,..., R_n)} \right ]</math>