# Occam's Razor, and the Law of Conservation of Belief (ESSLLI)

"Entities should not be multiplied without necessity." -- William of Ockham

In learning, perceiving or thinking about the world, we are fitting models to the data of experience. Typically our hypothesis space will span models varying greatly in complexity: some models will have many more free parameters or degrees of freedom than others. Under traditional approaches to model fitting where we adjust each model's parameters until it fits best, a model with strictly more free parameters or more adjustable knobs will tend to be preferred regardless of whether it actually comes closer to describing the true processes that generated the data. But this is not the way the mind works. We assess models with a natural eye for Occam's razor, balancing fit to the data with model complexity in subtle ways that does not inevitably prefer the most complex model, in order to avoid "over-fitting" our data and ensure successful generalizations and predictions.

Fitting curves (or smooth functions) to sparsely sampled, noisy data provides a familiar example of the problem.

The figure above shows three polynomial fits: a 1st-order (linear), a 2nd-order (quadratic), and a 12th-order polynomial. How do our minds decide which of these functions provides the best account of the data? The 1st-order model captures the rough trend of the data but seems too coarse; it attributes some of the variation that we see as "signal" to "noise". The 2nd-order model seems best; it seems to be just complex enough to fit the main shape of the data without over-fitting the noise. The 12th-order model seems ridiculously over-fit; with 13 data points, the parameters can be adjusted so as to make the curve pass exactly through every data point, thus taking as "signal" all of what we see as "noise". Again we can think of this as a causal inference problem: each function represents a hypothesis about the causal process that gave rise to the observed data, including the shape of the function from inputs to outputs, as well as the level of noise added to the output.

An elegant and powerful form of Occam's razor arises in the context of Bayesian inference, known as the Bayesian Occam's razor. Bayesian Occam's razor refers to the fact that "more complex" hypotheses about the data are penalized automatically in conditional inference. The notion of complexity here is "flexibility": a hypothesis which is flexible enough to generate many different observations is more complex, and will be less likely. Because more complex hypotheses describe a greater variety of data sets, they have to assign a lower probability to each one—a consequence of the law of conservation of belief. When we condition on some data, all else being equal, the posterior distribution over the hypotheses will favor the simpler ones because they have the "tightest" fit to the observations.

From the standpoint of probabilistic programming, the Bayesian Occam's razor is essentially inescapable. We don't judge models based on their best fitting behavior but rather on their average behavior. No fitting per se occurs during conditional inference. Instead, we draw conditional samples from each model representing the model's likely ways of generating the data. A model which tends to fit the data well on average -- to produce relatively many generative histories with that are consistent with the data -- will do better than a model that can fit better only for atypical parameter settings but worse on average. We sometimes call this the Law of Conservation of Belief. If beliefs are causal histories of how the observed data were generated, and degrees of belief are the relative frequencies with which certain histories are sampled, then a model has a fixed amount of belief mass it can spread around the space of possible histories. If you believe A very strongly, you can't believe B or C very strongly, because in order to believe A strongly you must be generating A samples most of the time.

# The Size Principle

A simple case of Bayes Occam's razor comes from the size principle (Tenenbaum and Griffiths, 2001): Of the many hypotheses consistent with the data the smallest rapidly become the most probable as more data points are observed. The following Church program demonstrates this with a very simple model. Here we have two hypothesized sets of symbols: Big has 10 elements and Small has 3 elements. We choose one of the hypotheses at random and sample some number of symbols from it uniformly.

With a single observed a, we already favor hypothesis Small. What happens when we increase the amount of data we condition on?

As can be seen by this example, as the number of data points increases, the hypothesis Small rapidly comes to dominate the posterior distribution. Why is this happening? We sample observations uniformly from hypotheses; the law of conservation of belief tells us that under these circumstances the probability of a draw from Big is [itex]1/10[/itex], while the probability of a draw from Small is [itex]1/3[/itex]. Thus the probability of drawing a set of observations from Big is [itex]1/10[/itex] raised to the number of observations, while the probability of drawing a set of observations from Small is [itex]1/3[/itex] raised to the number of observations. Clearly the later number decreases in the number of observations much more slowly than the former. The posterior distribution over hypotheses is given by:

[itex]P(hypothesis | observations) \propto P(observations|hypothesis)P(hypothesis)[/itex]

Because our hypotheses are equally likely, this simplifies to

[itex]P(hypothesis | observations) \propto P(observations|hypothesis)[/itex]

So we see that the the posterior distribution over hypotheses in this case is just the normalized likelihood of the observations given hypotheses. Because Small is "simpler" it quickly dominates the posterior.

We note that the size principle is related to an influential proposal of idea in linguistics known as the subset principle. Intuitively the subset principle suggests that when two grammars both account for the same data, the grammar that generates a smaller language should be preferred.[1]

## Example: Concept Learning

This rectangle learning game illustrates the role of the size principle in concept learning. The examples are assumed to be random samples from some unknown rectangle. Given the examples, what is the rectangle they came from? Explore how the concept learned varies as a function of the number and distribution of example points:

## The Size Principle and Implicit Negative Evidence

One way to understand the Size Principle is that probabilistic inference takes into account implicit negative evidence. More flexible hypotheses could have generated more observations. Thus if those hypotheses were the true hypotheses we would expect to see a greater variety of observations. If the data does not contain them, this is a form of negative evidence against those hypotheses. Importantly, the Size Principle tells us that the prior distribution on hypotheses does not have to penalize complexity. The complexity of the hypothesis itself will lead to its being disfavored in the posterior distribution.

# The Law of Conservation of Belief

At this point it is convenient to emphasize an aspect of probabilistic modeling that seems deceptively trivial, but comes up repeatedly when thinking about inference. In Bayesian statistics we think of probabilities as being degrees of belief. Our generative model reflects world knowledge and the probabilities that we assign to the possible sampled values reflect how strongly we believe in each possibility. The laws of probability theory ensure that our beliefs remain consistent as we reason.

A consequence of coherent belief maintenance is known as the Law of Conservation of Belief (LCB). Here are two equivalent formulations of this principle:

1. Sampling from a distribution selects exactly one possibility (in doing so it implicitly rejects all other possible values).
2. The total probability mass of a distribution must sum to [itex]1[/itex]. That is, we only have a single unit of belief to spread around.

The latter formulation leads to a common metaphor in discussing generative models: We can usefully think of belief as a "currency" that is "spent" by the probabilistic choices required to construct a sample. Since each choice requires "spending" some currency, an outcome that requires more choices to construct is will generally be more costly, i.e. less probable.

A consequence of the law of conservation of belief is that the relative belief in two propositions will be unchanged by conditioning "all other things being equal". More precisely, when conditioning on a proposition some things which were previously possible will become impossible (i.e. have probability 0); except where probabilities must change to make the proposition true, the ratio of probabilities of other propositions will stay stay the same. Understanding when probabilities "must change" can be subtle because it is a property of the complete dependency structure of the generative model.

# Generalizing the Size Principle: Bayes Occam's Razor

In our example above we have illustrated Bayes Occam's razor with examples based strictly on the size of the hypotheses involved, however, the principle is more general. Bayes' Occam razor says that all else being equal the hypothesis that assigns the highest likelihood to the data will dominate the posterior. Because of the law of conservation of belief, assigning higher likelihood to the observed data means assigning lower likelihood to other possible data. Consider the following example

In this example, unlike the size principle cases above, both hypotheses have the same support. However, hypothesis A is skewed toward examples a and b—while it can produce c or d, it is less likely to do so. In this sense hypothesis A is less flexible than hypothesis B. The data set we conditioned on also has exemplars of all the elements in the support of the two hypotheses. However, because there are more exemplars of elements favored by hypothesis A, this hypothesis is favored in the posterior. The Bayesian Occam's razor emerges naturally here, and is often described as favoring the less flexible hypothesis.

Similar effects emerge if hypothesis B is not uniform, but favors different examples that hypothesis A:

Try changing the observed letters to d c d a b. How does the inference change? Note that it takes less evidence here to favor hypothesis A that when B is uniform, but still more that in a size principle case (where A wouldn't be able to generate c or d at all)—but the size principle case would be unable to handle the "exceptional" observations of c and d. This examples suggest another way to understand Bayes Occam's razor: the posterior distribution will favor hypotheses for which the data set is simpler in the sense that it is more "easily generated." Here more "easily" generated, means generated with higher probability. We will see a more striking example of this for compositional models at the end of this section of the tutorial.

### Example: Blocking

Blocking is a basic phenomenon in language comprehension and production that naturally embodies the Bayesian Occam's razor.

# Model selection with the Bayesian Occam's Razor

The law of conservation of belief turns most clearly into Occam's Razor when we consider models with more internal structure: some continuous or discrete parameters that at different settings determine how likely the model is to produce data that look more or less like our observations. To select among models we simply need to describe each model as a probabilistic program, and also to write a higher-level program that generates these hypotheses. Church query will then automatically draw samples at both of these levels of abstraction, asking which models are most likely to have given rise to the observed data, as well as for each of those models, which internal parameter settings are most likely.

## Example: Fair or unfair coin?

In the previous section we considered learning about the weight of a coin, and noted that a simple prior on weights seemed unable to capture our more discrete intuition that we first decide if the coin if fair or not, and only then worry about its weight. This example shows how our inferences about coin flipping can be explained in terms of model selection guided by the Bayesian Occam's razor. Imagine a coin that you take out of a freshly unwrapped roll of quarters straight from the bank. Almost surely this coin is fair... But how does that sense change when you see more or less anomalous sequences of flips? We can simultaneously ask if the coin is fair, and what is its weight.

Try these cases and see if the inferences accord with your intuitions:

``` (h h t h t h h h t h) -> fair coin, probability of H = 0.5
```
``` (h h h h h h h h h h) -> ?? suspicious coincidence, probability of H = 0.5 ..?
```
``` (h h h h h h h h h h h h h h h) -> probably unfair coin, probability of H near 1
```
``` (h h h h h h h h h h h h h h h h h h h h) -> definitely unfair coin, probability of H near 1
```
``` (h h h h h h t h t h h h h h t h h t h h h h h t h t h h h h h t h h h h h
h h h h t h h h h t h h h h h h h h) -> unfair coin, probability of H = 0.85
```

## Example: Curve Fitting

This example shows how the Bayesian Occam's Razor can be used to select the right order of a polynomial fit. Choose one of two polynomials we have predefined, represented by true-coeffs-1 or true-coeffs-2. This program will sample data from it (at points given by x-vals) and consider a space of models comprising polynomials from degree 0 through 3.

# Example: Scene Inference

Imagine your are in a world of colored blocks that typically looks something like this:

And one day you see this 1x2 red patch... is it one 1x2 block or two 1x1 blocks?

We can model this inference by building a generative model of scenes. To do so we use simple models of geometry and of rendering geometric objects to an image (in this case by layering them):

There is a slight preference for one object over two.

Now let's see what happens if we see the two "halves" moving together. We use a simple random drift model of motion: an object randomly moves right or left or stays in place. Critically, the motion of each object is independent of other objects.

We see that there is a much stronger preference for the one object interpretation. It is a much bigger coincidence (as measured by Bayes Occam's razor) for the to halves to move together if they are unconnected, than if they are connected, or if they are merely seen statically next to each other.

We see from these examples that several of the gestalt principles for perceptual grouping emerge from this probabilistic scene inference setup (in this case "good continuation" and "common fate"). However, these are graded inferences, rather than hard rules: in this case we found that the information available in a static image was much weaker that the information in a moving image (and hence good continuation was weaker than common fate). This effect may have important developmental implications: the psychologist Elizabeth Spelke has found that young infants do not group objects by any static features (such as good continuation) but they do group them by common motion (see Spelke, 1990, Cognitive Science).

1. The term subset principle is usually used in linguistics to refer to the notion that a grammar that generates a smaller language should be preferred to one that generates a larger language. However, the name originally was introduced by Bob Berwick to refer to a result due to Dana Angluin giving necessary and sufficient conditions for Gold-style learnability of a class of languages. Essentially it states that a class of languages is learnable in the limit using this principle if every language in the class has a characteristic subset.