# Models with Unbounded Complexity

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
To return to the top level: Probabilistic Models of Cognition Tutorial.

By Timothy J. O'Donnell, Noah D. Goodman, Andreas Stuhlmueller, and the Church Working Group.

So far we have seen several examples of hierarchical and non-hierarchical static probability models. These are models where the number of random choices needed to generate observations, and their relationship to one another, is fixed in advance. Bayes nets, which we saw in the last section, were developed as an idiom for expressing these kinds of models. However, the true power of using a language like Church to specify dynamic models is that we can define models with unbounded complexity. Here we discuss two kinds of models with unbounded complexity. The first kind of model is a model where we allow an unbounded amount of repeated work during a computation via recursion. The second kind of model is one where we implicitly define an infinite model, but only explore its structure lazily using memoization.

## Recursion

One way to get unbounded complexity is to let a single expression represent many computations--for instance by iterating. In the $$\lambda$$-calculus iteration is performed by using recursion. Recursion simply means writing a function that calls itself inside its own body. How can we get the length of a list? The following ChurchServ window shows one way to do it.

{{#churchserv: (define (length list)

   (if (null? list)
0
(+ 1 (length (rest list)))))


(length '(1 2 3 4) )

}}

There are a number of things to note about this example. First we have used the null? procedure which tests to see if a list is empty. We use this procedure to test for the base case of the recursion: the condition that stops the recursion. This recursion loops though the list, one element at a time, adding one to the result of calling itself on the rest of the list. When it gets to the end of the list it returns 0.

We have already seen several uses of the repeat procedure. This procedure can be used to build up a list of (independent) samples from some thunk. How can this procedure be implemented in Church? Repeat is what is called a higher-order procedure: a procedure that takes another procedure as an argument, or returns it as a value.

{{#churchserv: (define (repeat K thunk)

   (if (= K 0)
'()
(pair (thunk) (repeat (- K 1) thunk))))


(repeat 5 flip)

}}

One of the most powerful aspects of the LISP family of programming languages is its set of high-order list manipulating functions. For a deeper understanding of these please see Deterministic List Manipulating Functions.

### Example: 'Bag of Words' Models

In information retrieval documents are often represented as bags of words. This means that the content of the document is represented as an unordered collection of the words it contains. We can write a simple generative model for bags of words in Church.

{{#churchserv: (define nouns '(chef omelet soup)) (define verbs '(eat work)) (define dictionary (append nouns verbs))

(define document-length (poisson 5))

(define document (repeat document-length (lambda () (uniform-draw dictionary))))

document }}

Here we have used several new ERPs. The first is poisson which implements the Poisson distribution. The Poisson distribution is distribution over the non-negative integers that takes a parameter representing the mean of the distribution. Here is a plot of the Poisson distribution for several different parameter values.

Another ERP which we have used here is uniform-draw which takes a list and return one of its elements uniformly at random. The generative model we have defined first samples a length for the document from the poisson distribution and then proceeds to sample a document (represented as a list) by repeatedly calling uniform-draw on a dictionary of words. This generative model gives our first example of model with unbounded complexity. Poisson defines a distribution on all non-negative integers; in principle, we can generate a document of any size.

In the example above we used a uniform distribution over a set of objects, implemented by the ERP uniform-draw. This is a special case of the more general class of distributions known as multinomial distributions. The ERP multinomial implements a distribution over $$K$$ discrete outcomes. In other words, it is just a $$K$$-sided biased die.

{{#churchserv: (define my-die (lambda () (multinomial '(1 2 3 4) '(1/2 1/6 1/6 1/6))))

(my-die) (my-die) (my-die) }}

multinomial takes two arguments, the first is a list of the objects to sample from and the second is a list, of the same length, of probabilities for those objects. uniform-draw can be implemented using multinomial.

{{#churchserv: (define (uniform-draw list)

 (let* ((list-length (length list))
(weights (repeat list-length (lambda () (/ 1 list-length)))))
(multinomial list weights)))


(uniform-draw '(1 2 3 4 5)) }}

We saw above how we could use the beta distribution to draw weights for flip. A multinomial distribution is just a generalization of flip from 2 possible outcomes to $$K$$ possible outcomes. There is a corresponding generalization of the beta distribution, called a Dirichlet distribution. A Dirichlet distribution is a distribution on length $$K$$ vectors of real numbers which sum to 1. These vectors fall into a $$K$$-dimensional simplex: the generalization of a triangle to multiple dimensions. The Dirichlet distribution is parameterized by a vector of $$K$$ pseudo-counts— one for each of the possible outcomes. These pseudo-counts work just like the pseudo-counts for the beta distribution. The figure below shows a 4-dimensional Dirichlet distribution for several different pseudo-counts.

We can use the Dirichlet distribution to draw a random die.

{{#churchserv:

(define my-die

 (let ((weights (dirichlet '(1 1 1 1))))
(lambda () (multinomial '(1 2 3 4) weights))))


(my-die) (my-die) (my-die) }}

We can now build a hierarchical bag-of-words model that puts a prior on the multinomial parameters. We draw a different set of parameters for each document.

{{#churchserv: (define nouns '(chef omelet soup)) (define verbs '(eat work)) (define dictionary (append nouns verbs))

(define document-length (poisson 5))

(define document-proportions (dirichlet (repeat (length dictionary) (lambda () 1))))

(define document (repeat document-length (lambda () (multinomial dictionary document-proportions))))

document }}

### Example: Geometric Distributions

In our "bag of words" example, the computation was unbounded because we first drew the size of the document from a Poisson distribution and then passed this to repeat. A more interesting case of unboundedness comes when we make use of randomness directly in the recursion itself. For example, an important probability distribution is the geometric distribution. The geometric distribution is a distribution over the non-negative integers that represents the probability of flipping a coin $$N$$ times and getting exactly 1 head. This distribution can be written in Church with the following simple recursion.

{{#churchserv: (define (geometric p)

 (if (flip p)
0
(+ 1 (geometric p))))


(geometric .8) }}

Notice a remarkable property of this definition: the base case of the recursion is probabilistic. There is no upper bound on how long the computation can go on, although the probability of reaching some number declines quickly as we walk out on the number line.

### Example: N-gram Models

Many systems in computational linguistics make use of n-gram models, also called Markov models. An n-gram model is a drastically simplified model of sentence structure where the distribution over words in a particular position depends on the last $$(N-1)$$ words. N-gram models are a simple example of an unbounded model where the structure of the computation itself depends on probabilistic choices. Here is the Church code for a simple bigram model, where the next word depends on the last.

{{#churchserv:

(define vocabulary '(chef omelet soup eat work bake stop))

(define (sample-words last-word)
(if (equal? last-word 'stop)
'()
(pair last-word
(let ((next-word
(case last-word
(('start) (multinomial vocabulary '(0.0032 0.4863 0.0789 0.0675 0.1974 0.1387 0.0277)))
(('chef)  (multinomial vocabulary '(0.0699 0.1296 0.0278 0.4131 0.1239 0.2159 0.0194)))
(('omelet)(multinomial vocabulary '(0.2301 0.0571 0.1884 0.1393 0.0977 0.1040 0.1831)))
(('soup)  (multinomial vocabulary '(0.1539 0.0653 0.0410 0.1622 0.2166 0.2664 0.0941)))
(('eat)   (multinomial vocabulary '(0.0343 0.0258 0.6170 0.0610 0.0203 0.2401 0.0011)))
(('work)  (multinomial vocabulary '(0.0602 0.2479 0.0034 0.0095 0.6363 0.02908 0.0133)))
(('bake)  (multinomial vocabulary '(0.0602 0.2479 0.0034 0.0095 0.6363 0.02908 0.0133)))
(else 'error))))
(sample-words next-word)))))

(sample-words 'start)


}}

Here we have used a case statement, which is another Church branching construct, and has the following syntax.

(case key-expression
((object-1-1 object-1-2 ...) expression-1)
((object-2-1 object-2-2 ...) expression-2)
...
((object-N-1 object-N-2 ...) expression-N)
(else else-expression))


Church first evaluates the key-expression and then tries to match its values against one of the objects in one of the match lists. If the object matches, then the corresponding expression is evaluated. If the object doesn't match anywhere then the else clause is evaluated. sample-words is a recursion which builds up a list of words by sampling the next word conditional on the last. Each word is sampled from a multinomial distribution whose parameters are fixed. We start the recursion by sampling conditional on the special symbol start. When we sample the symbol stop we end the recursion. Like the geometric distribution defined above, here we have another example of a process that can produce unbounded structures. However, unlike the case of the geometric distribution, now there are many paths through the space of probabilistic choices.

LISP is famous for its higher-order list manipulating functions (which you can read about here: Deterministic List Manipulating Functions.) One particular function, called fold is especially powerful: it can be used to do any list-based recursion. In the probabilistic setting there exists an important related procedure called unfold which recursively builds up lists. unfold takes three arguments. The first is some object, the second is a transition function, which returns the next object given the last one. The last argument is a predicate that stops the recursion.

 (define (unfold current transition stop?)
(if (stop? current)
'()
(pair current (unfold (transition current) transition stop?))))


With unfold defined we can now refactor our bigram model.

{{#churchserv: (define (unfold current transition stop?)

  (if (stop? current)
'()
(pair current (unfold (transition current) transition stop?))))

(define vocabulary '(chef omelet soup eat work bake stop))

(define (transition word)
(case word
(('start) (multinomial vocabulary '(0.0032 0.4863 0.0789 0.0675 0.1974 0.1387 0.0277)))
(('chef)  (multinomial vocabulary '(0.0699 0.1296 0.0278 0.4131 0.1239 0.2159 0.0194)))
(('omelet)(multinomial vocabulary '(0.2301 0.0571 0.1884 0.1393 0.0977 0.1040 0.1831)))
(('soup)  (multinomial vocabulary '(0.1539 0.0653 0.0410 0.1622 0.2166 0.2664 0.0941)))
(('eat)   (multinomial vocabulary '(0.0343 0.0258 0.6170 0.0610 0.0203 0.2401 0.0011)))
(('work)  (multinomial vocabulary '(0.0602 0.2479 0.0034 0.0095 0.6363 0.02908 0.0133)))
(('bake)  (multinomial vocabulary '(0.0602 0.2479 0.0034 0.0095 0.6363 0.02908 0.0133)))
(else 'error)))

(define (stop? word)
(equal? word 'stop))

(unfold 'start transition stop?)


}}

We will see many uses of unfold and related higher-order procedures later in the tutorial.

## Unbounded Complexity through Implicitly Infinite Distributions: mem

We have seen several ways in which we can achieve unbounded complexity via recursion. Another way in which unbounded complexity can be achieved is by defining a model on an infinite structure, only a finite amount of which will be "used" in any one sample. Church provides the mem operator to easily construct such models.

mem is a higher order procedure which takes another procedure and produces a memoized version of it: this memoized procedure has a persistent value within each "run" of the generative model. For instance consider the equality of two flips, and the equality of two memoized flips:

{{#churchserv:

(equal? (flip) (flip))

(begin (define mem-flip (mem flip)) (equal? (mem-flip) (mem-flip)) ) }}

Note that mem effects the meaning of probabilistic procedures, but not deterministic ones (which always return the same value anyway). Memoization is usually used in computer science as an optimization techniqeu (see Memoization as Optimization), in Church it is used for expressiveness.

To understand why memoization is useful, consider building a model with a set of objects that each have a randomly chosen property.

{{#churchserv: (define (eye-color person) (uniform-draw '(blue green brown)))

(list

(eye-color 'bob)
(eye-color 'alice)
(eye-color 'bob)


) }}

This is clearly wrong: Bob's eye color changes each time we ask about it! What we want is a model in which eye color is random, but persistent. We can do this using mem

{{#churchserv: (begin

(define eye-color
(mem
(lambda (person) (uniform-draw '(blue green brown)))))


(list

(eye-color 'bob)
(eye-color 'alice)
(eye-color 'bob)


) ) }}

This type of modeling is called random world style (McAllester, Milch, Goodman, 2008). Note that we don't have to specify ahead of time the people whose eye color we will ask about: the distribution on eye colors is implicitly defined over the infinite set of possible people, but only constructed "lazily" when needed.

### Example: Mixture Modeling

A popular class of probabilistic models which we can use to demonstrate the use of mem is mixture models. Mixture models are a type of hierarchical model for cluster data. The generative process for mixture models consists of two parts. To generate an observation, we first sample a component, from the so-called mixing distribution. This can be thought of as the "cluster" that the observation is from. We then sample the actual observation from some distribution associated with that particular component. These models are called "mixture" models because the ultimate distribution over data points is a "mixture" of the different distributions associated with each component according to their weight in the mixing distribution.

A simple example of a mixture model can be give using a multinomial as a mixing distribution and Gaussian distributions as the component distributions. A Gaussian distribution, also called a normal distribution, is the familiar bell-curve distribution over real numbers and is parameterized by two parameters: a mean and a variance (or standard deviation). The mean parameter controls where the distribution is centered on the real line, and the variance parameter controls the distribution's spread: how wide it is.

{{#churchserv: (define num-components 3) (define num-observations 50)

(define (mixing-distribution)

 (multinomial (iota num-components) (repeat num-components (lambda () (/ 1 num-components)))))



(define component-parameters

 (repeat num-components
(lambda () (list  (gaussian 0 1) (gaussian 0 1)))))



(define (generate-observation)

 (let* ((component (mixing-distribution))
(parameters (list-ref component-parameters component)))
(apply gaussian parameters)))



(truehist (repeat num-observations generate-observation) "Mixture")

}}

Here we use the iota procedure, which takes an integer $$N$$ and returns a list of length $$N$$ that counts up by one from 0.

{{#churchserv: (iota 10) }}

We use this procedure to generate labels for our components. We define a thunk, called mixing-distribution which samples one of these numbered components from a uniform multinomial. We then sample a list component-parameters which consists of a mean/variance pair for each component. These are bundled together into a list. We define a procedure generate-observation which first samples a component from the mixing-distribution, retrieves the mean and variance associated with that component and then samples a value from a Gaussian with those parameters.

Finally, we use generate-observation to sample several observations. Notice that here we had to associate a mean/variance pair with each component. We did this by a certain "trick." We used numbers to name the components and thus we could also use those component numbers to index into the the list of sampled parameters. This is a case where we could have used mem instead.

{{#churchserv: (begin (define num-components 3) (define num-observations 50)

(define (mixing-distribution)

 (multinomial (iota num-components) (repeat num-components (lambda () (/ 1 num-components)))))



(define component->parameters

 (mem
(lambda (component)
(list  (gaussian 0 1) (gaussian 0 1)))))



(define (generate-observation)

 (let* ((component (mixing-distribution))
(parameters (component->parameters component)))
(apply gaussian parameters)))



(truehist (repeat num-observations generate-observation) "Mixture") ) }}

Here we have defined a memoized procedure called component->parameters which takes a component and returns its parameters. The body of the procedure simply samples a mean and a variance parameter and wraps them up as a list and returns them. However, the procedure is memoized on the component argument. This means that the first time it is called with this argument it samples a mean/variance pair. On subsequent calls to the argument, it simply returns the same pair. This way of doing things has the advantage that we do not have to rely on the "trick" of indexing our components with numbers. For example, if our component had non-integer names, the mem idiom would still work.

{{#churchserv: (begin (define components '(a b c)) (define num-components (length components)) (define num-observations 50)

(define (mixing-distribution)

 (multinomial components (repeat num-components (lambda () (/ 1 num-components)))))



(define component->parameters

(mem
(lambda (component)
(list  (gaussian 0 1) (gaussian 0 1)))))



(define (generate-observation)

 (let* ((component (mixing-distribution))
(parameters (component->parameters component)))
(apply gaussian parameters)))



(truehist (repeat num-observations generate-observation) "Mixture") ) }}

## Example: Hierarchical N-gram Models

Suppose that we wanted to put a prior distribution on the multinomial transitions in the n-gram model that we defined earlier. Using mem this is very straightforward.

{{#churchserv: (define (unfold current transition stop?)

  (if (stop? current)
'()
(pair current (unfold (transition current) transition stop?))))


(begin

 (define vocabulary '(chef omelet soup eat work bake stop))

(define word->distribution
(mem
(lambda (word)
(dirichlet (repeat (length vocabulary) (lambda () 1))))))

(define (transition word)
(multinomial vocabulary (word->distribution word)))

(define (stop? word)
(equal? word 'stop))

(unfold 'start transition stop?)


) }}

In fact, the use of mem here considerably simplifies the overall program.

## Example: Hidden Markov Models

Another popular model in computational linguistics is the hidden Markov model (HMM). The HMM consists of a number of hidden states with transitions between them. For each state there is an observation model which outputs observations conditional on the state.

{{#churchserv: (begin (define states '(s1 s2 s3 s4 s5 s6 s7 s8 stop))

(define vocabulary '(chef omelet soup eat work bake))

(define state->observation-model

 (mem
(lambda (state)
(dirichlet (repeat (length vocabulary) (lambda () 1))))))


(define (observation state)

 (multinomial vocabulary (state->observation-model state)))


(define state->transition-model

 (mem
(lambda (state)
(dirichlet (repeat (length states) (lambda () 1))))))


(define (transition state)

 (multinomial states (state->transition-model state)))


(define (stop? state)

 (equal? state 'stop))


(define (unfold-HMM observation transition stop? start-state )

 (if (stop? start-state)
'()
(pair (observation start-state) (unfold-HMM observation transition stop? (transition start-state)))))


(unfold-HMM observation transition stop? 'start) ) }}

## Example: Probabilistic Context-free Grammars

Probabilistic context-free grammars (PCFGs) have been the dominant formalism for doing parsing in computational linguistics for almost twenty years. There are many ways to write a PCFG in Church. One way, inspired by Prolog programming, is to let non-terminals be represented by procedures.

{{#churchserv: (define (terminal t) (lambda () t))

(define D (lambda ()

           (map sample
(multinomial
(list (list (terminal 'the) )
(list (terminal 'a)))
(list (/ 1 2) (/ 1 2))))))


(define N (lambda ()

           (map sample
(multinomial
(list (list (terminal 'chef))
(list (terminal 'soup))
(list (terminal 'omelet)))
(list (/ 1 3) (/ 1 3) (/ 1 3))))))


(define V (lambda ()

           (map sample
(multinomial
(list (list (terminal 'cooks))
(list (terminal 'works)))
(list (/ 1 2) (/ 1 2))))))


(define A (lambda ()

           (map sample
(multinomial
(list (list (terminal 'diligently)))
(list (/ 1 1))))))


(define AP (lambda ()

            (map sample
(multinomial
(list (list A))
(list (/ 1 1))))))


(define NP (lambda ()

            (map sample
(multinomial
(list (list D N))
(list (/ 1 1))))))


(define VP (lambda ()

            (map sample
(multinomial
(list (list V AP)
(list V NP))
(list (/ 1 2) (/ 1 2))))))


(define S (lambda ()

           (map sample
(multinomial
(list (list NP VP))
(list (/ 1 1))))))


(S) }}

Here we have used several new Church procedures. The first of these is a very important procedure in functional programming known as map. map takes a procedure and a list and returns the list that results from applying the procedure to each element in the list. It can be written as follows.

{{#churchserv: (define (map proc list)

    (if (null? list)
'()
(pair (proc (first list)) (map proc (rest list)))))


(map (lambda (x) (+ x 1)) '(1 2 3 4)) }}

The second procedure we have used is sample. sample is a procedure which applies a thunk, resulting in a sample. sample can be defined by the following Church code.

{{#churchserv: (define (sample distribution)

    (apply distribution '()))


(sample flip) }}

Now, let's look at one of the procedures defining our PCFG in detail.

(define VP (lambda ()
(map sample
(multinomial
(list (list V AP)
(list V NP))
(list (/ 1 2) (/ 1 2))))))


When VP is called it maps sample across a list which is sampled from a multinomial distribution: in this case either (V AP) or (V NP). These two lists correspond to the right-hand sides (RHSs) of the rules $$VP \longrightarrow V AP$$ and $$VP \longrightarrow V NP$$ in the standard representation of PCFGs. These are lists that consist of symbols which are actually the names of other procedures. Therefore when sample is applied to them, they themselves recursively sample their RHSs until no more recursion can take place. Note that we have wrapped our terminals up in thunks so that when they are sampled they return the terminal symbol.

One thing to note about this example is that the each of these procedures is nearly identical. This suggests that we could write the grammar more succinctly. It turns out that there is a generalization of unfold called tree-unfold which will do the trick. Using tree-unfold we can rewrite our PCFG like this:

{{#churchserv:

(define (terminal t) (list 'terminal t))

(define (unwrap-terminal t) (second t))

(define (tree-unfold transition start-symbol)
(if (terminal? start-symbol)
(unwrap-terminal start-symbol)
(pair start-symbol
(map (lambda (symbol) (tree-unfold  transition symbol)) (transition start-symbol)))))



(define (terminal? symbol)

 (if (list? symbol)
(if (equal? (first symbol) 'terminal)
true
false)
false))


(define (transition nonterminal)

 (case nonterminal
(('D) (multinomial(list (list (terminal 'the))
(list (terminal 'a)))
(list (/ 1 2) (/ 1 2))))
(('N) (multinomial (list (list (terminal 'chef))
(list (terminal 'soup))
(list (terminal 'omelet)))
(list (/ 1 3) (/ 1 3) (/ 1 3))))
(('V) (multinomial (list (list (terminal 'cooks))
(list (terminal 'works)))
(list (/ 1 2) (/ 1 2))))
(('A) (multinomial (list (list (terminal 'diligently)))
(list (/ 1 1))))
(('AP) (multinomial (list (list 'A))
(list (/ 1 1))))
(('NP) (multinomial (list (list 'D 'N))
(list (/ 1 1))))
(('VP) (multinomial (list (list 'V 'AP)
(list 'V 'NP))
(list (/ 1 2) (/ 1 2))))
(('S) (multinomial (list (list 'NP 'VP))
(list (/ 1 1))))
(else 'error)))


(tree-unfold transition 'S)

}}

Exercise: write a version of the preceding PCFG that draws the RHS distributions from a Dirichlet distribution.

## Priors for Infinite Discrete Distributions: The Dirichlet Processes

In all of the examples we have seen so far of multinomial distributions we have known in advance the number of kinds of observations.

{{#churchserv: (define parameters (dirichlet (repeat 6 (lambda () 1/6)))) (multinomial '(a b c d e f) parameters) }}

Here we have defined a hierarchical model on a set of 6 possible observation types. However, it is often that case that we do not know in advance the number of types of observations. We would like to be able to define a prior on draws from an infinite multinomial. Just as the Dirchlet distribution defines a prior on parameters for a multinomial with $$K$$ possible outcomes, the Dirichlet process defines a prior on parameters for a multinomial with an infinite number of possible outcomes. Here is how it works.

First we imagine that we draw an infinite sequence of samples from a beta distribution with parameters $$1$$ and $$\alpha$$. Remember that a beta distribution defines a distribution on the interval (0,1). We write this infinite set of draws as $$\left\{\beta'_k\right\}_{k=1}^{\infty}$$.

$\beta'_k \sim \mathrm{Beta}\left(1,\alpha\right)$

We want to define a distribution on an infinite set of discrete outcomes. Without loss of generality we can think of this as defining a distribution on the natural numbers. We do this by saying that the probability of the natural number $$k$$ is given by the following expression.

$\beta_k = \prod_{i=1}^{k-1}\left(1-\beta'_i\right)\cdot\beta'_k$

What does this mean? To sample natural number $$k$$ we walk down the natural numbers in order, flipping a coin with weight $$\beta'_i$$ for each one. If the coin comes up false we continue, if the coin comes up true we stop and return that natural number. Convince yourself that the probability of getting natural number $$k$$ is given by $$\beta_k$$ above.

Now we have a way of assigning a probability to an infinite discrete set—the natural numbers, however, we made the unrealistic assumption that we could first draw this infinite stream of beta samples. In practice, we do this lazily by using mem. We only sample a new value, $$\beta'_k$$, when we need it. Let's call the $$\beta'_k$$'s "sticks" (think of them as sticks of height proportional to $$\beta'_k$$). We can define a procedure, pick-a-stick, that walks down the list of sticks and flips a coin at each one: if the coin comes up true it returns the index of that stick, if it comes up false it recurses further.

(define (pick-a-stick sticks J)
(if (flip (sticks J))
J
(pick-a-stick sticks (+ J 1))))


pick-a-stick is a higher-order procedure that takes another procedure called sticks which returns the stick weight for each stick. We could define sticks as follows.

(define sticks
(mem
(lambda (index) (beta 1 1))))


Notice that sticks uses mem to associate a particular draw from beta with each natural number. When we call it again with the same index we will get back the same stick weight. We can put these last ideas together in a procedure called make-sticks which takes the $$\alpha$$ parameter as an input and returns a procedure which samples stick indices.

{{#churchserv: (begin (define (pick-a-stick sticks J)

 (if (flip (sticks J))
J
(pick-a-stick sticks (+ J 1))))


(define (make-sticks alpha)

 (let ((sticks (mem (lambda (x) (beta 1.0 alpha)))))
(lambda () (pick-a-stick sticks 1))))


(define my-sticks (make-sticks 1))

(hist (repeat 1000 my-sticks) "Dirichlet Process")) }}

my-sticks samples from the natural numbers by walking down the list starting at 1 and flipping a coin weighted by a fixed $$\beta'_k$$ for each $$k$$. When the coin comes up true it returns $$k$$ otherwise it keeps going. It does this by drawing the individual $$\beta'_k$$ lazily, only generating new ones when we have walked out further than furthest previous time.

We will see later how we can use Dirichlet processes to generalize memoization to stochastic memoization and define infinite grammars and other exotic objects.