# Occam's Razor

"Entities should not be multiplied without necessity." -- William of Ockham

In learning, perceiving or thinking about the world, we are fitting models to the data of experience. Typically our hypothesis space will span models varying greatly in complexity: some models will have many more free parameters or degrees of freedom than others. Under traditional approaches to model fitting where we adjust each model's parameters until it fits best, then choose the best-fitting model — a model with strictly more free parameters will tend to be preferred regardless of whether it actually comes closer to describing the true processes that generated the data. But this is not the way the mind works. We assess models with a natural eye for complexity, balancing fit to the data with model complexity in subtle ways that will not inevitably prefer the most complex model. Instead we often seem to judge models using Occam's razor: we choose the least complex hypothesis that fits the data well. In doing so we avoid "over-fitting" our data in order to support successful generalizations and predictions.

Fitting curves (or smooth functions) to sparsely sampled, noisy data provides a familiar example of the problem.

The figure above shows three polynomial fits: a 1st-order (linear), a 2nd-order (quadratic), and a 12th-order polynomial. How do our minds decide which of these functions provides the best account of the data? The 1st-order model captures the rough trend of the data but seems too coarse; it attributes some of the variation that we see as "signal" to "noise". The 2nd-order model seems best; it seems to be just complex enough to fit the main shape of the data without over-fitting the noise. The 12th-order model seems ridiculously over-fit; with 13 data points, the parameters can be adjusted so as to make the curve pass exactly through every data point, thus taking as "signal" all of what we see as "noise". Again we can think of this as a causal inference problem: each function represents a hypothesis about the causal process that gave rise to the observed data, including the shape of the function from inputs to outputs, as well as the level of noise added to the output.

# Occam's Razor

An elegant and powerful form of Occam's razor arises in the context of Bayesian inference, known as the Bayesian Occam's razor. Bayesian Occam's razor refers to the fact that "more complex" hypotheses about the data are penalized automatically in conditional inference. In many formulations of Occam's razor, complexity is measured syntactically: for instance, it may be the description length of the hypothesis in some representation language, or a count of the number of free parameters needed to pick out the hypothesis from some larger model class. Syntactic forms of Occam's razor have difficulty justifying the complexity measure on principled, non-arbitrary grounds. They also leave unspecified exactly how the weight of a complexity penalty should trade off with an alternative measure of fit to the data. Fit is intrinsically a semantic notion, a matter of correspondence between the model's predictions and our observations of the world. When complexity is measured syntactically and fit is measured semantically, they are intrinsically incommensurable and the trade-off between them will always be to some extent arbitrary.

In the Bayesian Occam's razor, both complexity and fit are measured semantically. The semantic notion of complexity is a measure of flexibility: a hypothesis that is flexible enough to generate many different sets of observations is more complex, and will tend to receive lower posterior probability than a less flexible, simpler hypothesis that explains the same data. Because more complex hypotheses can generate a greater variety of data sets, they must necessarily assign a lower probability to each one. When we condition on some data, all else being equal, the posterior distribution over the hypotheses will favor the simpler ones because they have the tightest fit to the observations.

From the standpoint of a sampling-based probabilistic programming language such as Church, the Bayesian Occam's razor is essentially inescapable. We do not judge models based on their best fitting behavior but rather on their average behavior. No fitting per se occurs during conditional inference. Instead, we draw conditional samples from each model representing the model's likely ways of generating the data. A model that tends to fit the data well on average -- to produce relatively many generative histories with that are consistent with the data -- will do better than a model that can fit better only for atypical parameter settings but worse on average.

## The Law of Conservation of Belief

It is convenient to emphasize an aspect of probabilistic modeling that seems deceptively trivial, but comes up repeatedly when thinking about inference. In Bayesian statistics we think of probabilities as being degrees of belief. Our generative model reflects world knowledge and the probabilities that we assign to the possible sampled values reflect how strongly we believe in each possibility. The laws of probability theory ensure that our beliefs remain consistent as we reason.

A consequence of belief maintenance is known as the Law of Conservation of Belief (LCB). Here are two equivalent formulations of this principle:

1. Sampling from a distribution selects exactly one possibility (in doing so it implicitly rejects all other possible values).
2. The total probability mass of a distribution must sum to $1$. That is, we only have a single unit of belief to spread around.

The latter formulation leads to a common metaphor in discussing generative models: We can usefully think of belief as a "currency" that is "spent" by the probabilistic choices required to construct a sample. Since each choice requires "spending" some currency, an outcome that requires more choices to construct it will generally be more costly, i.e. less probable.

It is this conservation of belief that gives rise to the Bayesian Occam's razor. A hypothesis that spends its probability on many alternatives that don't explain the current data will have less probability for the alternatives that do, and will hence do less well overall than a hypothesis which only entertains options that fit the current data. We next examine a special case where this tradeoff plays out clearly, the size principle, then come back to the more general cases.

# The Size Principle

A simple case of Bayes Occam's razor comes from the size principle [1]: Of hypotheses which generate data uniformly, the one consistent with the data and with smallest extension is the most probable.

The following Church program demonstrates the size principle with a very simple model. Here we have two hypothesized sets: Big has 6 elements and Small has 3 elements. The generative model chooses one of the hypotheses at random and samples some number of symbols from it uniformly. We then wish to infer the hypothesis given observed elements.

With a single observed a, we already favor hypothesis Small. What happens when we increase the amount of observed data? Consider the learning trajectory: As the number of data points increases, the hypothesis Small rapidly comes to dominate the posterior distribution. Why is this happening? We sample observations uniformly from hypotheses, the law of conservation of belief and the symmetry between observations imply that the probability of a draw from Big is $\frac{1}{6}$, while the probability of a draw from Small is $\frac{1}{3}$. Thus, by the product rule of probabilities, the probability of drawing a set of N observations from Big is $(\frac{1}{6})^N$, while the probability of drawing a set of observations from Small is $(\frac{1}{3})^N$. The later probability decreases much more slowly than the former as the number of observations increases. Using Bayes' rule, the posterior distribution over hypotheses is given by:

$P(hypothesis | observations) \propto P(observations|hypothesis)P(hypothesis)$

Because our hypotheses are equally probable a priori, this simplifies to

$P(hypothesis | observations) \propto P(observations|hypothesis)$

So we see that the the posterior distribution over hypotheses in this case is just the normalized likelihood $P(observations|hypothesis)$. The likelihood ratio $P(observations|Big)/P(observations|Small) = (\frac{1}{2})^N$ determines how quickly the simpler hypothesis Small comes to dominate the posterior.

The size principle is related to an influential proposal in linguistics known as the subset principle. Intuitively the subset principle suggests that when two grammars both account for the same data, the grammar that generates a smaller language should be preferred.[2]

## Example: The Rectangle Game

To illustrate the power of the size principle in inferring the most parsimonious explanation of data, consider learning a rectangle "concept" [3]. The data are a set of point in the plane, that we assume to be randomly sampled from within some unknown rectangle. Given the examples, what is the rectangle they came from? We can model this learning as conditional of the rectangle given the points. We plot the sampled rectangles as well as the posterior mean:

Explore how the concept learned varies as a function of the number and distribution of example points. Try varying the observed data and seeing how the inferred rectangle changes:

(define obs-data '((0.2 0.6) (0.2 0.8) (0.4 0.8) (0.4 0.6) (0.3 0.7)))
(define obs-data '((0.4 0.7) (0.5 0.4) (0.45 0.5) (0.43 0.7) (0.47 0.6)))
(define obs-data '((0.4 0.7) (0.5 0.4)))
(define obs-data '((0.4 0.7) (0.5 0.4) (0.46 0.63) (0.43 0.51) (0.42 0.45) (0.48 0.66)))

Compare this to the results of a slightly different causal process for the same observations, known as weak sampling:

## The Size Principle and Implicit Negative Evidence

One way to understand the Size Principle is that probabilistic inference takes into account implicit negative evidence. More flexible hypotheses could have generated more observations. Thus if those hypotheses were the true hypotheses we would expect to see a greater variety of observations. If the data does not contain them, this is a form of negative evidence against those hypotheses. Importantly, the Size Principle tells us that the prior distribution on hypotheses does not have to penalize complexity. The complexity of the hypothesis itself will lead to its being disfavored in the posterior distribution.

# Generalizing the Size Principle: Bayes Occam's Razor

In our example above we have illustrated Bayes Occam's razor with examples based strictly on the "size" of the hypotheses involved, however, the principle is more general. Bayes' Occam razor says that all else being equal the hypothesis that assigns the highest likelihood to the data will dominate the posterior. Because of the law of conservation of belief, assigning higher likelihood to the observed data requires assigning lower likelihood to other possible data. Consider the following example

In this example, unlike the size principle cases above, both hypotheses lead to the same possible observed values. However, hypothesis A is skewed toward examples a and b—while it can produce c or d, it is less likely to do so. In this sense hypothesis A is less flexible than hypothesis B. The data set we conditioned on also has exemplars of all the elements in the support of the two hypotheses. However, because there are more exemplars of elements favored by hypothesis A, this hypothesis is favored in the posterior. The Bayesian Occam's razor emerges naturally here, and is often described as favoring the less flexible hypothesis.

Similar effects emerge if hypothesis B is not uniform, but favors different examples that hypothesis A:

Try changing the observed letters to d c d a b. How does the inference change? Note that it takes less evidence here to favor hypothesis A that when B is uniform, but still more that in a size principle case (where A wouldn't be able to generate c or d at all)—but the size principle case would be unable to handle the "exceptional" observations of c and d. This examples suggest another way to understand Bayes Occam's razor: the posterior distribution will favor hypotheses for which the data set is simpler in the sense that it is more "easily generated." Here more "easily" generated, means generated with higher probability. We will see a more striking example of this for compositional models at the end of this section of the tutorial.

# Model selection with the Bayesian Occam's Razor

The law of conservation of belief turns most clearly into Occam's Razor when we consider models with more internal structure: some continuous or discrete parameters that at different settings determine how likely the model is to produce data that look more or less like our observations. To select among models we simply need to describe each model as a probabilistic program, and also to write a higher-level program that generates these hypotheses. Church query will then automatically draw samples at both of these levels of abstraction, asking which models are most likely to have given rise to the observed data, as well as for each of those models, which internal parameter settings are most likely.

## Example: Fair or unfair coin?

In the previous section we considered learning about the weight of a coin, and noted that a simple prior on weights seemed unable to capture our more discrete intuition that we first decide if the coin if fair or not, and only then worry about its weight. This example shows how our inferences about coin flipping can be explained in terms of model selection guided by the Bayesian Occam's razor. Imagine a coin that you take out of a freshly unwrapped roll of quarters straight from the bank. Almost surely this coin is fair... But how does that sense change when you see more or less anomalous sequences of flips? We can simultaneously ask if the coin is fair, and what is its weight.

Try these cases and see if the inferences accord with your intuitions:

(h h t h t h h h t h) -> fair coin, probability of H = 0.5
(h h h h h h h h h h) -> ?? suspicious coincidence, probability of H = 0.5 ..?
(h h h h h h h h h h h h h h h) -> probably unfair coin, probability of H near 1
(h h h h h h h h h h h h h h h h h h h h) -> definitely unfair coin, probability of H near 1
(h h h h h h t h t h h h h h t h h t h h h h h t h t h h h h h t h h h h h h h h h t h h h h t h h h h h h h h) -> unfair coin, probability of H = 0.85

Now let's look at the learning trajectories for this model: In general (though not on every run) the learning trajectory stays near 0.5 initially—favoring the simpler hypothesis that the coin is fair—then switches fairly abruptly to near 0.9—as it infers that it is an unfair coin and likely has high weight. Here the Bayesian Occam's razor penalizes the hypothesis with the flexibility to learn any coin weight: we automatically get a notion of comparing the complexity of very differently-structure models.

## The Effect of Unused Parameters

When statisticians suggest methods for model selection, they often include a penalty for the number of parameters[4] This seems like a worrying policy from the point of view of a probabilistic program: we could always introduce parameters that are not used, and therefore have no effect on the program. For instance we could change the above coin flipping example so that it draws the potential unfair coin weight even in the model which gives a fair coin: The two models now have the same number of free parameters (the unfair coin weight), but we will still favor the simpler hypothesis, as we did above. Why? The Bayesian Occam's razor penalizes models not for having more parameters (or longer code) but for too much flexibility — being able to fit too many other potential observations. Unused parameters (or parameters with very little effect) don't increase this flexibility, and hence aren't penalized. The Bayesian Occam's razor only penalizes complexity that matters for prediction, and only to the extent that it matters.

## Example: Curve Fitting

This example shows how the Bayesian Occam's Razor can be used to select the right order of a polynomial fit. This Church program samples polynomials up to order 3 that are (noisily) consistent with the observed data, then graphs the mean polynomial of each order.

Try the above code using a different data set generated from the same function:

(define obs-y-vals '(0.66 -0.32 -0.41 -0.59 -0.87 -0.75 -0.23 0.47 1.31 1.97 3.25))


Try a simpler data set that should also suggest a 2nd order polynomial:

(define x-vals (range -3 3))
(define obs-y-vals '(2 0.1 -1 -1.5 -0.8 0 1.9))


These data should suggest a 1st order polynomial:

(define x-vals (range -3 3))
(define obs-y-vals '(2 1.6 0.9 0.3 -0.1 -0.45 -1.2))


# Example: Scene Inference

Imagine you are in a world of colored blocks that typically looks something like this:

And one day you see this 1x2 red patch... is it one 1x2 block or two 1x1 blocks?

We can model this inference by building a generative model of scenes. To do so we use simple models of geometry and of rendering geometric objects to an image (in this case by layering them):

There is a slight preference for one object over two.

Now let's see what happens if we see the two "halves" moving together. We use a simple random drift model of motion: an object randomly moves right or left or stays in place. Critically, the motion of each object is independent of other objects.

We see that there is a much stronger preference for the one object interpretation. It is a much bigger coincidence (as measured by Bayes Occam's razor) for the two halves to move together if they are unconnected, than if they are connected, or if they are merely seen statically next to each other.

We see from these examples that several of the gestalt principles for perceptual grouping emerge from this probabilistic scene inference setup (in this case "good continuation" and "common fate"). However, these are graded inferences, rather than hard rules: in this case we found that the information available in a static image was much weaker that the information in a moving image (and hence good continuation was weaker than common fate). This effect may have important developmental implications: the psychologist Elizabeth Spelke has found that young infants do not group objects by any static features (such as good continuation) but they do group them by common motion (see Spelke, 1990, Cognitive Science).

1. Tenenbaum and Griffiths, 2001
2. The term subset principle is usually used in linguistics to refer to the notion that a grammar that generates a smaller language should be preferred to one that generates a larger language. However, the name originally was introduced by Bob Berwick to refer to a result due to Dana Angluin giving necessary and sufficient conditions for Gold-style learnability of a class of languages. Essentially it states that a class of languages is learnable in the limit using this principle if every language in the class has a characteristic subset.
3. Tenenbaum, 2000
4. See for instance the AIC.