# Notes for future versions of tutorials or classes

## Re-org plan

• Mixture models: ordinary mixtures, 'naive' unbounded mixtures, hierarchical mixtures, LDA (admixtures), ...
• Recursive models: RR, language models.
• Higher-order distributions:
• transforming distributions,
• infinite models: names/words examples, CRP, CRPmem, (stick breaking), etc

## Notes on levels etc

'strong' vs 'weak' rational analysis: is there a fact of the matter in the world? is the generative model a description of the actual world or what the agent thinks or both?

generative model as representational theory of mind? or as just a description of the external world?

PLoT is "between" marr's levels 1 and 2.

## Possible class content

Justifications for probability algebra: dutch books, cox's axioms.

Two views of prediction: expectation by simulation and enumeration. Two views of conditional prediction: rejection and enumeration.

Inference algorithms: importance, mcmc, smc, dynamic programming. As both CS and psych process models.

## Index

List of all tutorial pages (for use in backups, using Special:Export):

Generative Models

Conditioning

Patterns of Inference

Learning as Conditional Inference

Occam's Razor, and the Law of Conservation of Belief

Hierarchical Models

Recursive Models

Non-Parametric Models

The Meaning of Probabilistic Programs

Tutorial Notes

Class project ideas

Ideas for graphics:

Barplot:
http://projects.csail.mit.edu/church/serv/d430a59bce5c5b93f865/

Scatterplot with overlay:
http://projects.csail.mit.edu/church/serv/41e1f7dcda0ea9ce20a3/


## ESSLLI To Do List

Josh's to do list:

1. Finish examples of concept learning: a. flesh out rectangle learning ---draw 20 random samples from the prior, 20 random samples from the posterior b. put in simple number game. c. more cognitive examples ---CONSIDER MOVING CONCEPT LEARNING TO A SEPARATE SECTION OF LEARNING AS CONDITIONAL INFERENCE, PRIOR TO OCCAM'S RAZOR ---When we introduce size principle, mention how the size principle comes from strong sampling. Talk about weak sampling as an alternative and the behavioral result.

2. Go over hierarchical models, add in more marble colors, insert nice pictures of bags of marbles, insert shape bias example.

3. Make graphics wrapper functions - see note from Noah and Andreas about where to put these.

4. Put examples of productivity of thought, reasoning under uncertainty, systematicity/modularity/composability.

---compositionality and probability are means that achieve these goals.

Noah's to do list:

1. Go over first part of tutorial (up through occam) revise and clean up.

2. Hierarchical section: Remove lambda stuff (move some to first section). Describe the 'tree-like dependency' structure better. Add some examples (which are most needed?).

3. Complete recursive models section. RR before grammar (written in direct style). Grammar builds to unfold style.

4. Complete social cognition (nested query) examples: simple goal and belief inference (alien vending machine). Partial knowledge in SI. Stuff about equifinality and efficiency.

5. Make sure 'non-parametric' and 'meaning of prob prog' sections are not embarrassing. (get help from dan and tim?)

## Introduction

Insert a few pictures of blocks, where the questions are:

is it stable?
if it is stable, how hard would you have to bump the table to get it to fall?
which blocks can you remove without bringing down the rest of the tower?
if you see it a moment later and it has fallen, which block was most likely removed?
which block is likely to fall furthest to the right side if it falls?
if it falls, will it knock over this object?


Another example:

google spelling correction.  You have a generative model, can do all sorts of things
it was never trained for.  Why?


Put in slides of newton, medel to illustrate causal theory...

## Conditioning

Explain a bit more how query works.

In a semester long class, or longer tutorial, we should explain how Metropolis Hastings works. Want to give an intuition for the need of burn in and mixing. Show as an example inference for an ising model... the challenge of mixing between all-on and all-off.

### Reasoning with complex queries

provide a more extended discussion of language of thought issues here, on the tug of war example.

generative model as a lexicon for the language used to create query and condition expressions.

this is a good fragment of a language for reasoning. it has:

- objects, finite number (generate from gensym

- names for entities which can be different from the actual entities

- attributes (functions with objects as one argument): (strength 'bob)

--- real valued

--- boolean valued

- relations (functions with two objects as arguments): (stronger-than 'bob 'emily), (+ ... ), (>= ...)

- persistent properties of objects (with mem)

- temporally indexed properties facts (who wins a particular match)

- math and logic

--- sets and subsets

--- analog magnitudes

- ability to define and reason about arbitrary propositions, including features as boolean filters on continuous

it will work for experiments because it draws on core knowledge systems:

- object files

- analog number

- social cognition

prediction: will only be able to reason well if core systems can be engaged. teams of bigger than 3 or 4 people won't work unless they can be chunked into subteams or treated as a mass. strengths have analog magnitude discrimination signatures (need them to be 150% or 200% as big).

other broad domains from theory learning

- family trees

- kinship

- magnetism

## Learning as conditional inference

Put in a subsection "Of priors and posteriors" describing the "usual" Bayesian setup. Students need to be familiar with this language. (Link to the meaning of prob progs section for more in depth relationship?)

What if the draws from the coin are not IID?

HMM idea: you have two coins that someone switches between every so often, and they have various weights. you see the sequence of flips and have to learn how the coins work, the switching rate, and infer the segmentation.

- easiest version: you know the coins and the switching rate, just have to infer the segmentation - a bit harder: you know the coins but have to learn the switching rate and infer the segmentation - full glory: you have to learn coins and everything else

nice to compare with human judgments

Some version of this could also be a class project. Copied there...

## Patterns of inference: Screening off and explaining away

Key point is when you condition, things which were independent become dependent, or things which were dependent become independent.

The first is most familiar: screening off.

leave late -> arrive airport on time -> missed plane

context-specific independence: vision example, the colored mach-card.

(also, useful for prediction and learning. physics is like this... tracking...)

the second is more interesting perhaps: explaining away.

for many causal structures, this dependence is an anticorrelation:

 this happens if you have additive or disjunctive causes

 e.g. simple medical diagnosis (breast cancer, benign cyst)
see how they become coupled.

  this anticorrelation is expressed in belief dynamics as we get additional
data.

  told no cyst.


anti-correlation can be pushed by other data, in an interesting reasoning dynamics. give other information: sister had a benign cyst... other symptoms: feels hard, or biopsy

not just anti-correlation. to see this consider a mathy example:

(define take-sample (lambda ()

 (rejection-query
(define a (uniform-draw (iota 10)))
(define b (uniform-draw (iota 10)))

   (list a b)

   (equal? (+ a b))
)))


(hist (repeat 500 take-sample) "a, b")

 correlation: (equal? a b)

complex:
(equal? 3 (abs (- a b)))
(equal? 3 (modulo (- a b) 10))
(equal? (modulo a 2) (modulo b 2))
(equal? (modulo a 5) (modulo b 5))


vision example: condition on it being darker there by the ratio you can see at the shadow boundary: illumination difference is equal to the luminance distance there.

nonmonotonic logic: classically, confidence can only increase. two states: unknown and known, and as you learn more, you can only go from unknown to known. but in common sense, you have a continuous spectrum of confidence and you can move up and down nonmonotonically.

## Bayesian Occam's Razor

Explaining the force of inference: the Law of Conservation of Belief.

Finding structure in the world: Bayes occam's razor.

size principle:

a, b, c, ... subset examples
learning curve: 1, 3, 10

number game.
multiples:  2, 3 shot learning
intevals: slow learning

(show figure of a hybrid model)


this is a special case of the likelihood principle.

  what we saw with coin flipping, 0.5 vs. 0.95.


both size principle and likelihood principle derive from

a more general notion: law of conservation of belief.


cognitive examples: implicit negative evidence

  size principle applied to concept and words learning

    show word learning figure from tics
explain cancellation with violation of random sampling.

  syntax acquisition


Now, for most cases of finding structure in the world, you need

to have a more complex hypothesis space with differeint amounts of
structure.  you need to decide about hypothesis complexity, and
balance complexity with fit to the data.

the ideas behind the size principle or the likelihood principle, the
law of conservation of belief basically, can be extended to this
case using models with multiple draws.

-> bayes occam

model selection: choose between a fair coin and weighted coin

towards bayes occam's razor / minimum description length
simple 1-20 number game with 0, 1 or 2 free parameters.

scene inference

curve fitting.


## Hierarchical Models

Add discussion of abstractness as 'distance from (perceptual) evidence' (hierarchical notion) and 'function with more variables' (lambda notion). It is often the case that lambda abstraction will imply hierarchical abstraction, since the lambda term will be applied to generate observed data. (Tree-like dependence can come in from either of these structures.)

Discuss two different factors that can drive the blessing of abstraction: smaller hypothesis spaces and pooling of statistical evidence.

Generate convincing learning curve plots for BoA in the marbles setting. (Use expected squared error from truth as measure of how much has been learned.)

## Nonparametric models

In drawing a sample from the dp mixture, give an example of the high school lunch room. The way MH query is actually by hypothetically taking a person out and seeing where they go.

## Recursive models

recursion follows from untyped lambda, though representational bias may be different.

talk about recursion, productivity, and parsimony. to generate an infinite set of (structurally) different representations it is necessary to use recursion. to generate a very large (combinatorial) set of representations it is not. however describing the large set of representations is more parsimonious under recursion -- the recursive representation will thus be preferred by bayes occam's razor.

tecumseh, fitch, hauser stuff?

### Compositional concept learning

Give extensions of rational rules to inferring simple motor programs from "handwritten characters" on a 3 x 3 retina.

## Nested query

vending machine example for goal and belief inference, and bob's box...