Inference about inference: Nested query (ESSLLI)
- To return to the top level: ESSLLI Tutorial.
The query operator is an ordinary Church function, in the sense that it can occur anywhere that any other function can occur. In particular, we can construct a query with another query inside of it: this represents hypothetical inference about a hypothetical inference. (There are some implementation-specific restrictions on this. In MIT-Church the mh-query operator cannot be nested inside itself, though the rejection-query operator can.)
Nested queries are particularly useful in modeling social cognition: reasoning about another agent, who is herself reasoning.
How can we capture our intuitive theory of other people? Central to our understanding is the principle of rationality: an agent will choose actions that she expects to lead to outcomes that satisfy her goals. (This is a slight restatement of the principle as discussed in Baker, Saxe, and Tenenbaum, 2009, building on earlier work by Dennet, 1987, and Gergely and Csibra, among others.) We can represent this in Church by an inner query—an agent infers an action which will lead to their goal being satisfied:
(define (choose-action goal? transition state) (query (define action (action-prior)) action (goal? (transition state action))))
The function transition describes the outcome of taking a particular action in a particular state, the predicate goal? determines whether or not a state accomplishes the goal, the input state represents the current state of the world. The function action-prior used within choose-action represents an a-priori tendency towards certain actions.
For instance, imagine that Sally walks up to a vending machine wishing to have a cookie. Imagine also that we know the mapping between buttons (potential actions) and foods (outcomes). We can then predict Sally's action:
We see, unsurprisingly, that if Sally wants a cookie, she will always press button b. (In defining the vending machine we have used the case statement instead of a long set of ifs.) In a world that is not quite so deterministic Sally's actions will be more stochastic:
Technically, this method of making a choices is not optimal, but rather it is soft-max optimal (also known as following the "Boltzmann policy").
Now imagine that we don't know Sally's goal (which food she wants), but we observe her pressing button b. We can use a query to infer her goal (this is sometimes called "inverse planning", since the outer query "inverts" the query inside choose-action).
Now let's imagine a more ambiguous case: button b is "broken" and will (uniformly) randomly result in a food from the machine. If we see Sally press button b, what goal is she most likely to have?
Despite the fact that button bis equally likely to result in either bagel or cookie, we have inferred that sally probably wants a cookie. Why would this be? (Hint: if she had wanted a bagel, what would she have done?) This effect follows, though indirectly, from the law of conservation of belief in this inference about inference setting.
If we have some prior knowledge about Sally's preferences (which goals she is likely to have) we can incorporate this immediately into the prior over goals (which above was uniform).
A more interesting situation is when we believe that Sally has some preferences, but we don't know what they are. We capture this by adding a higher level prior (a Dirichlet) over preferences. Using this we can learn about Sally's preferences from her actions: after seeing Sally press button b several times, what will we expect her to want the next time?
Try varying the amount and kind of evidence. For instance, if Sally one time says "I want a cookie" (so you have directly observed her goal that time) how much evidence does that give you about her preferences, relative to observing her actions?
In the above models of goal and preference inference, we have assumed that the structure of the world (both the operation of the vending machine and the, irrelevant, initial state) were common knowledge—they were non-random constructs used by both the agent (Sally) selecting actions and the observer interpreting these actions. What if we (the observer) don't know how exactly the vending machine works, but think that however it works Sally knows? We can capture this by placing uncertainty on the vending machine, inside the overall query but "outside" of Sally's inference:
Here we have conditioned on Sally wanting the cookie and Sally choosing to press button b. Thus, we have no direct evidence of the effects of pressing the buttons on the machine. What happens if you condition instead on the action and outcome, but not the intentional choice of this outcome (that is, change the condition to (equal? (vending-machine 'state 'b) 'cookie))?
Now imagine a vending machine that has only one button, but it can be pressed many times. We don't know, what the machine will do in response to a given button sequence. We do know that pressing more buttons is less a priori likely.
Compare the inferences that result if Sally presses the button twice to those if she only presses the button once. Why can we draw much stronger inferences about the machine when Sally chooses to press the button twice? When Sally does press the button twice, she could have done the "easier" (or rather, a priori more likely) action of pressing the button just once. Since she doesn't, a single press must have been unlikely to result in a cookie. This is an example of the principle of efficiency—all other things being equal, an agent will take the actions that require least effort (and hence, when an agent expends more effort all other things must not be equal). Indeed, this example shows that the principle of efficiency emerges from inference about inference via the Bayesian Occam's razor. Sally has an infinite space of possible actions. Because these actions are constructed by a recursive generative process, simpler actions are a priori more likely.
In these examples we have seen two important assumptions combining to allow us to infer something about the world from the indirect evidence of an agents actions. The first assumption is the principle of rational action, the second is an assumption of knowledgeability—we assumed that Sally knows how the machine works, though we don't. Thus inference about inference, can be a powerful way to learn what others already know, by observing their actions. (This example was inspired by Goodman, Baker, Tenenbaum, 2009.)
Joint inference about beliefs and desires
In social cognition, we often make joint inferences about two kinds of mental states: agents' beliefs about the world and their desires, goals or preferences. We can see an example of such a joint inference in the vending machine scenario. Suppose we condition on two observations: that Sally presses the button twice, and that this results in a cookie. Then, assuming that she knows how the machine works, we jointly infer that she wanted a cookie, that pressing the button twice is likely to give a cookie, and that pressing the button once is unlikely to give a cookie.
Notice the U-shaped distribution for the effect of pressing the button just once. Without any direct evidence about what happens when the button is pressed just once, we can infer that it probably won't give a cookie—because her goal is likely to have been a cookie but she didn't press the button just once—but there is a small chance that her goal was actually not to get a cookie, in which case pressing the button once could result in a cookie. This very complex (and hard to describe!) inference comes naturally from joint inference of goals and knowledge.
Knowing what you know
What if Sally may or may not know how the machine works, and knows this about herself?
An agent can have knowledge of some parts of the world and not others. We need to incorporate a model of how informational access (such as seeing an object) effects knowledge.
A Communication Game
Imagine playing the following two-player game. On each round the "teacher" pulls a die from a bag of weighted dice, and has to communicate to the "learner" which die it is (both players are familiar with the dice and their weights). However, the teacher may only communicate by giving the learner examples: showing them faces of the die.
We can formalize the inference of the teacher in choosing the examples to give by assuming that the goal of the teacher is to successfully teach the hypothesis -- that is, to choose examples such that the learner will infer the intended hypothesis (throughout this section we simplify the code by specializing to the situation at hand, rather than using the more general choose-action function introduced above):
(define (teacher die) (query (define side (side-prior)) side (equal? die (learner side))))
The goal of the learner is to infer the correct hypothesis, given that the teacher chose to give these examples:
(define (learner side) (query (define die (die-prior)) die (equal? side (teacher die))))
This pair of mutually recursive functions represents a teacher choosing examples or a learner inferring a hypothesis, each thinking about the other. However, notice that this recursion will never halt—it will be an unending chain of "I think that you think that I think that...". To avoid this infinite recursion say that eventually the learner will just assume that the teacher rolled the die and showed the side that came up (rather than reasoning about the teacher choosing a side):
(define (teacher die depth) (query (define side (side-prior)) side (equal? die (learner side depth)))) (define (learner side depth) (query (define die (die-prior)) die (if (= depth 0) (equal? side (roll die)) (equal? side (teacher die (- depth 1))))))
To make this concrete, assume that there are two dice, A and B, which each have three sides (red, green, blue) that have weights like so:
Which hypothesis will the learner infer if the teacher shows the green side?
If we run this with recursion depth 0—that is a learner that does probabilistic inference without thinking about the teacher thinking—we find the learner infers hypothesis B most of the time (about 60% of the time). This is the same as using the "strong sampling" assumption: the learner infers B because B is more likely to have landed on side 2. However, if we increase the recursion depth we find this reverses: the learner infers B only about 40% of the time. Now die A becomes the better inference, because "if the teacher had meant to communicate B, they would have shown the red side because that can never come from A."
This model, has been proposed by Shafto and Goodman (2008) as a model of natural pedagogy. They describe several experimental tests of this model in the setting of simple "teaching games," showing that people make inferences as above when they think the examples come from a helpful teacher, but not otherwise.
Communicating with Words
Unlike the situation above, in which concrete examples were given from teacher to student, words in natural language denote more abstract concepts. However, we can use almost the same setup to reason about speakers and listeners communicating with words, if we assume that sentences have literal meanings. We assume for simplicity that the meaning of sentences are truth-functional: that each sentence corresponds to a function from states of the world to true/false.
As above, the speaker chooses what to say in order to lead the listener to infer the correct state:
(define (speaker state) (query (define words (sentence-prior)) words (equal? state (listener words))))
The listener does an inference of the state of the world given that the speaker chose to say what they did:
(define (listener words) (query (define state (state-prior)) state (equal? words (speaker state)))))
However this suffers from two flaws: the recursion never halts, and the literal meaning has not been used. We slightly modify the listener function such that the listener either assumes that the literal meaning of the sentence is true, or figures out what the speaker must have meant given that they chose to say what they said:
(define (listener words) (query (define state (state-prior)) state (if (flip literal-prob) (words state) (equal? words (speaker state))))))
Here the probability
literal-prob controls the expected depth of recursion. Another ways to bound the depth of recursion is with an explicit depth argument (which is decremented on each recursion).
Example: Scalar Implicature
Let us imagine a situation in which there are three plants which may or may not have sprouted. We imagine that there are three sentences that the speaker could say, "All of the plants have sprouted", "Some of the plants have sprouted", or "None of the plants have sprouted". For simplicity we represent the worlds by the number of sprouted plants (0,1,2, or 3) and take a uniform prior over worlds. Using the above representation for communicating with words:
We see that if the listener hears "some" the probability of three out of three is low, even though the basic meaning of "some" is equally consistent with 3/3, 1/3, and 2/3. This is called the "some but not all" implicature.
Semantics: truth-functions or distributions?
In the above we have used a standard, truth-functional, formulation for the meaning of a sentence: each sentence specifies a (deterministic) predicate on world states. For instance, "All balls are red." translated into something like
(lambda (world) (null? (filter (not red?) (objects world)))). Thus the semantics of a sentence specifies the worlds in which the sentence is satisfied. This can be immediately (without changing any code) relaxed to probabilistic truth functions, that assign a probability to each world. This might be useful if we want to allow exceptions, for example a noisy all might be:
(lambda (world) (if (flip noise) true (null? (filter (not red?) (objects world))))).
An alternative semantics would be for sentences to instead denote distributions on world states, that is stochastic thunks (functions with no arguments) that return possible worlds. All of the models for communications discussed in this section can be reformulated to work with distributions on worlds as the denotation of a sentence (exercise: do this reformulations for the some-but-not-all example). A key question for ongoing research is whether they should be—are there psycholinguistic or semantic reasons to prefer one formulation over the other?
Recursively optimal planning.
Gergely and Csibra principle of efficiency and equifinality come from Bayes Occam.