# Mixture and Non-Parametric Models

In the section on Hierarchical Models we saw the power of probabilistic inference in learning about the latent structure underlying different kinds of observations: the mixture of colors in different bags of marbles, or the prototypical features of categories of animals. In that discussion we always assumed that we knew what kind each observation belonged to—the bag that each marble came from, or the subordinate, basic, and superordinate category of each object. Knowing this allowed us to pool the information from each observation for the appropriate latent variables. What if we don't know a priori how to divide up our observations?

In this section we explore several answers to the problem of simultaneously discovering the kinds and their properties.

# Learning Categories

Imagine a child who enters the world and begins to see objects. She can't begin by learning the typical features of cats or mice, because she doesn't yet know that there are such kinds of objects as cats and mice. Yet she may quickly notice that some of the objects all tend to purr and have claws, while other objects are small and run fast—she can cluster the objects together on the basis of common features and thus form categories (such as cats and mice), whose typical features she can then learn.

To formalize this learning problem, we begin by adapting the bags-of-marbles examples from the Hierarchical Models section. However, we now assume that the bag that each marble is drawn from is unobserved and must be inferred.

We have queried on whether various marbles came out of the same bag, instead of directly querying on the bag number. This is because the bag number is only useful in its role of determining which objects have similar properties. Formally, the model we have defined above is symmetric in the bag labels (if you permute all the labels you get a new state with the same probability). We see that it is likely that obs1 and obs2 came from the same bag, but quite unlikely that obs3 did. Why? Notice that we have set alpha small, indicating a belief that the marbles in a bag will tend to all be the same color. How do the results change if you make alpha larger? Why?

Instead of assuming that a marble is equally likely to come from each bag, we could instead learn a distribution over bags where each bag has a different probability. This is called a mixture distribution over the bags:

Models of this kind are called mixture models because the observations are a "mixture" of several categories. Mixture models are extremely widely used in modern probabilistic modeling and are extremely important because they systems to learn the unobservable categories which underlie observable properties in the world.

Importantly, the observation distribution associated with each mixture component (i.e., kind or category) can be any kind of distribution we like. For example, here is a mixture model with Gaussian components.

## Example: Topic Models

One very popular class of mixture-based approaches are topic models, which are used for document classification, clustering, and retrieval. The simplest kind of topic models make the assumption that documents can be represented as bags of words — unordered collections of the words that the document contains. In topic models, each document is associated with a mixture over topics, each of which is itself a distribution over words.

One popular kind of bag-of-words topic model is known as Latent Dirichlet Allocation (LDA).[1] The generative process for this model can be described as follows. For each document, mixture weights over a set of $$K$$ topics are drawn from a Dirichlet prior. Then $$N$$ topics are sampled for the document—one for each word. Each topic itself is associated with a distribution over words, and this distribution is drawn from a Dirichlet prior. For each of the $$N$$ topics drawn for the document, a word is sampled from the corresponding multinomial distribution. This is shown in the Church code below.

In this simple example, there are two topics topic1 and topic2, and four words. These words are deliberately chosen to represent on of two possible subjects that a document can be about: One can be thought of as 'biology' (i.e., DNA and evolution), and the other can be thought of as 'linguistics' (i.e., parsing and syntax).

The documents consist of lists of individual words from one or the other topic. Based on the coocurrence of words within individual documents, the model is able to learn that one of the topics should put high probability on the biological words and the other topic should put high probability on the linguistic words. It is able to learn this because different kinds of documents represent stable mixture of different kinds of topics which in turn represent stable distributions over words.

## Example: Categorical Perception of Speech Sounds

### The Perceptual Magnet Effect

(Adapted from: Feldman, N. H., Griffiths, T. L., and Morgan, J. L. (2009). The influence of categories on perception: Explaining the perceptual magnet effect as optimal statistical inference. Psychological Review, 116(4):752–782.)

# Unknown Numbers of Categories

The models above describe how a learner can simultaneously learn which category each object belongs to, the typical properties of objects in that category, and even global parameters about kinds of objects in general. However, it suffers from a serious flaw: the number of categories was fixed. This is as if a learner, after finding out there are cats, dogs, and mice, must force an elephant into one of these categories, for want of more categories to work with.

The simplest way to address this problem, which we call unbounded models, would be to simply place uncertainty on the number of categories in the form of a hierarchical prior: FIXME this example doesn't run, due to a domain update bug.

We have here used the Poisson distribution [1] which is a distribution on non-negative integers. We have also used the special function gensym, which returns a fresh symbol every time it is called. It can be used to generate an unbounded set of labels for things like classes, categories and mixture components. Each evaluation of gensym results in a unique (although cryptic) symbol: Importantly, these symbols can be used as identifiers, because two different calls to gensym will never be equal:

Unbounded models give a straightforward way to represent uncertainty over the number of categories in the world. However, inference in these models often presents difficulties. We next describe another method for allowing an unknown number of things. In an unbounded model, there are a finite number of categories whose number is drawn from an unbounded prior distribution, such as the Poisson prior that we just examined. In the next section, we will see how it is possible to construct distributions of truly infinite numbers of objects.

## Infinite Discrete Distributions: The Dirichlet Processes

We have seen several examples of mixture models where the mixture components are chosen from a multinomial distribution and the weights of the mixture components are drawn from a Dirichlet prior. Both multinomial and Dirichlet distributions are defined for fixed numbers of categories—now, imagine generalizing the combination of Dirichlet and multinomial, to a multinomial over infinitely many categories of components. This would solve the problem of "running out of categories," because there would always be more categories that hadn't yet been used for any observation.

Just as the Dirchlet distribution defines a prior on parameters for a multinomial with $$K$$ possible outcomes, the Dirichlet process defines a prior on parameters for a multinomial with $$K = \infty$$—an infinite number of possible outcomes.

First, we imagine drawing an infinite sequence of samples from a beta distribution with parameters $$1,\ \alpha$$ (recall that a beta distribution defines a distribution on the interval $$[0,1]$$). We write this infinite set of draws as $$\left\{\beta'_k\right\}_{k=1}^{\infty}$$.

$\beta'_k \sim \mathrm{Beta}\left(1,\alpha\right)$

Ultimately we would like to define a distribution on an infinite set of discrete outcomes that will represent our categories or mixture components, but we start by defining a distribution on the natural numbers. The probability of the natural number $$k$$ is given by:

$\beta_k = \prod_{i=1}^{k-1}\left(1-\beta'_i\right)\cdot\beta'_k$

How can this be interpreted as a generative process? Imagine "walking" down the natural numbers in order, flipping a coin with weight $$\beta'_i$$ for each one; if the coin comes up false, we continue to the next natural number; if the coin comes up true, we stop and return the current natural number. Convince yourself that the probability of getting natural number $$k$$ is given by $$\beta_k$$ above.

To formalize this as a Church program, we define a procedure, pick-a-stick, that walks down the list of $$\beta'_k$$s (called sticks in the statistics and machine learning literatures) and flips a coin at each one: if the coin comes up true, it returns the index associated with that stick, if the coin comes up false the procedure recurses to the next natural number.

(define (pick-a-stick sticks J)
(if (flip (sticks J))
J
(pick-a-stick sticks (+ J 1))))

(define sticks
(mem (lambda (index) (beta 1 alpha))))


pick-a-stick is a higher-order procedure that takes another procedure called sticks, which returns the stick weight for each stick. (pick-a-stick is also a recursive function—one that calls itself—we will explore recursive functions more in the section Recursive Models.)

Notice that sticks uses mem to associate a particular draw from beta with each natural number. When we call it again with the same index we will get back the same stick weight. This (crucially) means that we construct the $$\beta'_k$$s only "lazily" when we need them—even though we started by imagining an infinite set of "sticks" we only ever construct a finite subset of them.

We can put these ideas together in a procedure called make-sticks which takes the $$\alpha$$ parameter as an input and returns a procedure which samples stick indices.

my-sticks samples from the natural numbers by walking down the list starting at 1 and flipping a coin weighted by a fixed $$\beta'_k$$ for each $$k$$. When the coin comes up true it returns $$k$$ otherwise it keeps going. It does this by drawing the individual $$\beta'_k$$ lazily, only generating new ones when we have walked out further than furthest previous time.[2]

## Stochastic Memoization with DPmem

The above construction of the Dirichlet process defines a distribution over the infinite set of natural numbers. We quite often want a distribution not over the natural numbers themselves, but over an infinite set of samples from some other distribution (called the base distribution): we can generalize the Dirichlet process to this setting by using mem to associate to each natural number a draw from the base distribution.

We can do a similar transformation to any church procedure: we associate to every argument and natural number pair a sample from the procedure, then use the Dirichlet process to define a new procedure with the same signature. In Church this useful higher-order distribution is called DPmem:

In a probabilistic setting a procedure applied to some inputs may evaluate to a different value on each execution. By wrapping such a procedure in mem we associate a randomly sampled value with each combination of arguments. We have seen how this is useful in defining random world style semantics, by persistently associating individual random draws with particular mem'd values. However, it is also natural to consider generalizing the notion of memoization itself to the stochastic case. Since DPmem is a higher-order procedure that transforms a procedure into one that sometimes reuses it's return values we call it a stochastic memoizer[3].

## Properties of DP Memoized Procedures

A procedure in Church defines a distribution. When we wrap such a procedure in DPmem the resulting procedure defines a new Dirichlet process distribution. The underlying distribution associated with proc in the code above is called the base measure of the Dirichlet process and is often written $$\mu$$ or sometimes $$G_0$$. In the following example we stochastically memoize a normal distribution.

The DP is said to concentrate the base measure. Draws from a normal distribution are real-valued. However, draws from a DP are discrete (with probability one). By probabilistically memoizing a normal distribution we take the probability mass that the Gaussian spreads across the real line and concentrate it into a countable number of specific points. Compare the result of the previous computation with the result of sampling from a normal distribution itself.

The way that the DP concentrates the underlying base measure is illustrated in the following figure.

In the stick-breaking construction stick heights become shorter on average as we walk further down the number line. This means that earlier draws from the DP are more likely to be redrawn than later draws. When we use the DP to construct DPmem the memoized function will therefore tend to favor reuse of earlier computed values. Intuitively, we will use DPmem when we need to model reuse of samples in a scenario where we do not know in advance how many samples we need.

## Infinite Mixture Models

We now return to the problem of categorization with an unknown number of categories. We can use the Dirichlet process to construct a distribution over an infinite set of (potential) bags: A model like this is called an infinite mixture model; in this case an infinite dirichlet-multinomial mixture model, since the observations (the colors) come from a multinomial distribution with Dirichlet prior. The essential addition in this model is that we have DPmem'd gensym to provide a collection of reusable category (bag) labels: To generate our observation in this infinite mixture model we first sample a category label from the memoized gensym. Since the Dirichlet process tends to reuse earlier choices (more than later ones), our data will tend to cluster together in earlier components. However, there is no a priori bound on the number of latent classes, rather there is just a bias towards fewer classes. The strength of this bias is controlled by the DP concentration parameter $$\alpha$$. When $$\alpha$$ is high, we will tolerate a larger number of classes, when it is low we will strongly favor fewer classes. In general, the number of classes grows proportional to $$\alpha\ \log(N)$$ where $$N$$ is the number of observations.

We can use this basic template to create infinite mixture models with any type of observation distribution. For instance here is an infinite gaussian mixture model:

There are, of course, many possible observation models that can be used in the infinite mixture. One advantage of using abstract objects to represent our data is that we can associate different observation models with different aspects of the data. For instance, we could have a model which both models height and propensity to smoke for the same set of individuals, based on their sex.

# Another View of the DP: The Chinese Restaurant Process

If you have looked at the literature on Bayesian models of cognition in the last few years, you will have seen many uses of a prior distribution known as the Chinese restaurant process. The Chinese restaurant process is an alternate, but equivalent, way to construct the Dirichlet process. The CRP is usually described as a sequential sampling scheme using the metaphor of a restaurant.

We imagine a restaurant with an infinite number of tables. The first customer enters the restaurant and sits at the first unoccupied table. The ($$N+1$$)th customer enters the restaurant and sits at either an already occupied table or a new, unoccupied table, according to the following distribution.

$\tau^{(N+1)} | \tau^{(1)},..., \tau^{(N)},\alpha \sim \sum_{i=1}^{K} \frac{ y_i }{ N + \alpha } \delta_{\tau_{i}} + \frac{ \alpha }{ N + \alpha } \delta_{\tau_{K+1}}$

$$N$$ is the total number of customers in the restaurant. $$K$$ is the total number of occupied tables, indexed by $$1 \geq i \geq K$$. $$\tau^{(j)}$$ refers to the table chosen by the $$j$$th customer. $$\tau_i$$ refers to $$i$$th occupied table in the restaurant. $$y_i$$ is the number of customers seated at table $$\tau_i$$; $$\delta_{\tau}$$ is the $$\delta$$-distribution which puts all of its mass on table $$\tau$$. $$\alpha \geq 0$$ is the concentration parameter of the model.

In other words, customers sit at an already-occupied table with probability proportional to the number of individuals at that table, or at a new table with probability controlled by the parameter $$\alpha$$.

Each table has a dish associated with it. Each dish $$v$$ is a label on the table which is shared by all the customers at that table. When the first customer sits at a new table, $$\tau_i$$, a dish is sampled from another distribution, $$\mu$$, and placed on that table. This distribution, $$\mu$$, is called the base distribution of the Chinese restaurant process, and is a parameter of the model. From then on, all customers who are seated at table $$\tau_i$$ share this dish, $$v_{\tau_i}$$.

The following animation demonstrates the Chinese restaurant process (click on it).

The CRP can be used to define a stochastic memoizer just as the Dirichlet process. We let the dish at each table be drawn from the underlying procedure. When we seat a customer we emit the dish labeling the table where the customer sat. To use a CRP as a memoization distribution we associate our underlying procedure with a set of restaurants---one for each combination of a procedure with its arguments. We let customers represent particular instances in which a procedure is evaluated, and we let the dishes labeling each table represent the values that result from those procedure applications. The base distribution which generates dishes corresponds to the underlying procedure which we have memoized.

When we seat a customer at an existing table, it corresponds to retrieving a value from the memory. Every customer seated at an existing table always returns the dish placed at that table when it was created. When we seat a customer at a new table it corresponds to computing a fresh value from our memoized random function and storing it as the dish at the new table. Another way of understanding the CRP is to think of it as defining a distribution over ways of partitioning $$N$$ items (customers) into $$K$$ partitions (tables), for all possible $$N$$ and $$K$$.

The probability of a particular partition of $$N$$ customers over $$K$$ tables is the product of the probabilities of the $$N$$ choices made in seating those customers. It can easily be confirmed that the order in which elements are added to the partition components does not affect the probability of the final partition (i.e. the terms of the product can be rearranged in any order). Thus the distribution defined by a CRP is exchangeable.

The probability of a particular CRP partition can also be written down in closed form as follows.

$P(\vec{y})=\frac{\alpha^{K}\Gamma[\alpha]\prod_{j=0}^{K}\Gamma[y_{j}]}{\Gamma[\alpha+\sum_{j=0}^{K}y_{j}]}$

Where $$\overrightarrow{y}$$ is the vector of counts of customers at each table and $$\Gamma(\cdot)$$ is the gamma function, a continuous generalization of the factorial function. This shows that for a CRP the vector of counts is sufficient.

As a distribution, the CRP has a number of useful properties. In particular, it implements a simplicity bias. It assigns a higher probability to partitions which:

1. have fewer customers
2. have fewer tables
3. for a fixed number of customers $$N$$, assign them to the smallest number of tables.

Thus the CRP favors simple restaurants and implements a rich-get-richer scheme. Tables with more customers have higher probability of being chosen by later customers. These properties mean that, all else being equal, when we use the CRP as a stochastic memoizer we favor reuse of previously computed values.

## Example: Goldwater Model 1

Goldwater, S., Griffiths, T. L., and Johnson, M. (2009). A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112:21–54.

# Example: The Infinite Relational Model

(Adapted from: Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. Proceedings of the 21st National Conference on Artificial Intelligence.)

Much semantic knowledge is inherently relational. For example, verb meanings can often be formalized as relations between their arguments. The verb "give" is a three-way relation between a giver, something given, and a receiver. These relations are also inherently typed. For example, the giver in the relation above is typically an agent. In a preceding section, we discussed the infinite mixture models, where observations were generated from a potentially unbounded set of latent classes. In this section we introduce an extension of this model to relational data: the infinite relational model (IRM).

Given some relational data, the IRM learns to cluster objects into classes such that whether or not the relation holds depends on the pair of object classes. For instance, we can imagine trying to infer social groups from the relation of who talks to who:

We see that the model invents two classes (the "boys" and the "girls") such that the boys talk to each other and the girls talk to each other, but girls don't talk to boys. Note that there is much missing data (unobserved potential relations) in this example.

# Example: CrossCat

(Adapted from: Shafto, P. Kemp, C., Mansignhka, V., Gordon, M., and Tenenbaum, J. B. (2006). Learning cross-cutting systems of categories. Proceedings of the Twenty-Eighth Annual Conference of the Cognitive Science Society.)

Often we have data where each object is associated with a number of features. In many cases, we can predict how these features generalize by assigning the objects to classes and predicting feature values on a class-by-class basis. In fact, we can encode this kind of structure using the IRM above if we interpret the relations as the presence of absence of a feature. However, in some cases this approach does not work because objects can be categorized into multiple categories, and different features are predicted by different category memberships.

For example, imagine we have some list of foods like: eggs, lobster, oatmeal, etc., and some list of features associated with these foods, such as: dairy, inexpensive, dinner, eaten with soup, etc. These features depend on different systems of categories that foods fall into, for example, one set of features may depend on the time of day that the foods are eaten, while another set may depend on whether the food is animal or vegetable based. CrossCat is a model designed to address these problems. The idea behind CrossCat is that features themselves cluster into kinds. Within each kind, particular objects will cluster into categories which predict the feature values for those objects in the kind. Thus CrossCat is a kind of hierarchical extension to the IRM which takes into account multiple domains of featural information.

# Other Non-Parametric Distributions

The Dirichlet Process is the best known example of a non-parametric distribution. The term non-parametric refers to statistical models whose size or complexity can grow with the data, rather than being specified in advance (sometimes these are called infinite models, since they assumes an infinite, not just unbounded number of categories). There a number of other such distributions that are worth knowing.

## Pitman-Yor Distributions

Many models in the literature use a small generalization of the CRP known as the Pitman-Yor process (PYP). The Pitman-Yor process is identical to the CRP except for having an extra parameter, $$a$$, which introduces a dependency between the probability of sitting at a new table and the number of tables already occupied in the restaurant.

The process is defined as follows. The first customer enters the restaurant and sits at the first table. The ($$N+1$$)th customer enters the restaurant and sits at either an already occupied table or a new one, according to the following distribution.

$$\tau^{(N+1)} | \tau^{(1)},...,\tau^{(N)}, a, b \sim \sum_{i=1}^{K} \frac{ y_i - a }{ N + b } \delta_{\tau_{i}} + \frac{ Ka + b }{ N + b } \delta_{\tau_{K+1}}$$

Here all variables are the same as in the CRP, except for $$a$$ and $$b$$. $$b \geq 0$$ corresponds to the CRP $$\alpha$$ parameter. $$0 \leq a \leq 1$$ is a new discount parameter which moves a fraction of a unit of probability mass from each occupied table to the new table. When it is $$1$$, every customer will sit at their own table. When it is $$0$$ the distribution becomes the single-parameter CRP. The $$a$$ parameter can be thought of as controlling the productivity of a restaurant: how much sitting at a new table depends on how many tables already exist. On average, $$a$$ will be the limiting proportion of tables in the restaurant which have only a single customer. The $$b$$ parameter controls the rate of growth of new tables in relation to the total number of customers $$N$$ as before.

Like the CRP, the sequential sampling scheme outlined above generates a distribution over partitions for unbounded numbers of objects. Given some vector of table counts $$\vec{y}$$, A closed-form expression for this probability can be given as follows. First, define the following generalization of the factorial function, which multiples $$m$$ integers in increments of size $$a$$ starting at $$x$$.

$$[x]_{m,s} = \begin{cases} 1 & \mathrm{for} m=0 \\ x(x+s)...(x+(m-1)s)& \mathrm{for} m > 0 \end{cases}$$

Note that $$[1]_{m,1} = m!$$.

The probability of the partition given by the count vector, $$\vec{y}$$, is defined by:

$$P( \vec{y} | a, b) = \frac{[b+a]_{K-1,a}}{[b+1]_{N-1,1}} \prod_{i=1}^{K}[1-a]_{y_i-1,1}$$

It is easy to confirm that in the special case of $$a = 0$$ and $$b >0$$, this reduces to the closed form for CRP by noting that $$[1]_{m,1} = m! = \Gamma[m+1]$$. In what follows, we will assume that we have a higher-order function PYmem which takes three arguments a, b, and proc and returns the PYP-memoized version of proc.

1. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Latent Dirichlet allocation. Journal of Machine Learning Research 3: pp. 993–1022.
2. This way of constructing a Dirichlet Process is known as the stick-breaking construction and was first defined in:
Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica, 4(2):639–650.
There are many other ways of defining a Dirichlet Process, one of which—the Chinese Restaurant Process—we will see below.
3. Goodman, Mansighka, Roy, Bonawaitz, Tenenbaum, 2008