Tutorial Notes
From Church Wiki
Notes for future versions of tutorials or classes
Link back to ESSLLI Tutorial
Re-org plan
- Mixture models: ordinary mixtures, 'naive' unbounded mixtures, hierarchical mixtures, LDA (admixtures), ...
- Recursive models: RR, language models.
- Higher-order distributions:
- transforming distributions,
- infinite models: names/words examples, CRP, CRPmem, (stick breaking), etc
Notes on levels etc
'strong' vs 'weak' rational analysis: is there a fact of the matter in the world? is the generative model a description of the actual world or what the agent thinks or both?
generative model as representational theory of mind? or as just a description of the external world?
PLoT is "between" marr's levels 1 and 2.
Possible class content
Justifications for probability algebra: dutch books, cox's axioms.
Two views of prediction: expectation by simulation and enumeration. Two views of conditional prediction: rejection and enumeration.
Inference algorithms: importance, mcmc, smc, dynamic programming. As both CS and psych process models.
Index
List of all tutorial pages (for use in backups, using Special:Export):
Generative Models
Conditioning
Patterns of Inference
Learning as Conditional Inference
Occam's Razor, and the Law of Conservation of Belief
Hierarchical Models
Recursive Models
Non-Parametric Models
Inference about inference: Nested query
The Meaning of Probabilistic Programs
Tutorial Notes
Class project ideas
Useful Links
Link to Class project ideas
To upload images: https://projects.csail.mit.edu/church/wiki/Special:Upload
Ideas for graphics:
Barplot: http://projects.csail.mit.edu/church/serv/d430a59bce5c5b93f865/
Scatterplot with overlay: http://projects.csail.mit.edu/church/serv/41e1f7dcda0ea9ce20a3/
ESSLLI To Do List
Josh's to do list:
1. Finish examples of concept learning: a. flesh out rectangle learning ---draw 20 random samples from the prior, 20 random samples from the posterior b. put in simple number game. c. more cognitive examples ---CONSIDER MOVING CONCEPT LEARNING TO A SEPARATE SECTION OF LEARNING AS CONDITIONAL INFERENCE, PRIOR TO OCCAM'S RAZOR ---When we introduce size principle, mention how the size principle comes from strong sampling. Talk about weak sampling as an alternative and the behavioral result.
2. Go over hierarchical models, add in more marble colors, insert nice pictures of bags of marbles, insert shape bias example.
3. Make graphics wrapper functions - see note from Noah and Andreas about where to put these.
4. Put examples of productivity of thought, reasoning under uncertainty, systematicity/modularity/composability.
---compositionality and probability are means that achieve these goals.
Noah's to do list:
1. Go over first part of tutorial (up through occam) revise and clean up.
2. Hierarchical section: Remove lambda stuff (move some to first section). Describe the 'tree-like dependency' structure better. Add some examples (which are most needed?).
3. Complete recursive models section. RR before grammar (written in direct style). Grammar builds to unfold style.
4. Complete social cognition (nested query) examples: simple goal and belief inference (alien vending machine). Partial knowledge in SI. Stuff about equifinality and efficiency.
5. Make sure 'non-parametric' and 'meaning of prob prog' sections are not embarrassing. (get help from dan and tim?)
Introduction
Insert a few pictures of blocks, where the questions are:
is it stable? if it is stable, how hard would you have to bump the table to get it to fall? which blocks can you remove without bringing down the rest of the tower? if you see it a moment later and it has fallen, which block was most likely removed? which block is likely to fall furthest to the right side if it falls? if it falls, will it knock over this object?
Another example:
google spelling correction. You have a generative model, can do all sorts of things it was never trained for. Why?
Put in slides of newton, medel to illustrate causal theory...
Conditioning
Explain a bit more how query works.
In a semester long class, or longer tutorial, we should explain how Metropolis Hastings works. Want to give an intuition for the need of burn in and mixing. Show as an example inference for an ising model... the challenge of mixing between all-on and all-off.
Bayesian expert systems for personalized medicine:
http://www.seralogix.com/patterns.htm
http://www.seralogix.com/downloads/patterns.pdf
Reasoning with complex queries
provide a more extended discussion of language of thought issues here, on the tug of war example.
generative model as a lexicon for the language used to create query and condition expressions.
this is a good fragment of a language for reasoning. it has:
- objects, finite number (generate from gensym
- names for entities which can be different from the actual entities
- attributes (functions with objects as one argument): (strength 'bob)
--- real valued
--- boolean valued
- relations (functions with two objects as arguments): (stronger-than 'bob 'emily), (+ ... ), (>= ...)
- persistent properties of objects (with mem)
- temporally indexed properties facts (who wins a particular match)
- math and logic
--- sets and subsets
--- analog magnitudes
- ability to define and reason about arbitrary propositions, including features as boolean filters on continuous
it will work for experiments because it draws on core knowledge systems:
- object files
- analog number
- social cognition
prediction: will only be able to reason well if core systems can be engaged. teams of bigger than 3 or 4 people won't work unless they can be chunked into subteams or treated as a mass. strengths have analog magnitude discrimination signatures (need them to be 150% or 200% as big).
other broad domains from theory learning
- family trees
- kinship
- magnetism
Learning as conditional inference
Put in a subsection "Of priors and posteriors" describing the "usual" Bayesian setup. Students need to be familiar with this language. (Link to the meaning of prob progs section for more in depth relationship?)
What if the draws from the coin are not IID?
HMM idea: you have two coins that someone switches between every so often, and they have various weights. you see the sequence of flips and have to learn how the coins work, the switching rate, and infer the segmentation.
- easiest version: you know the coins and the switching rate, just have to infer the segmentation - a bit harder: you know the coins but have to learn the switching rate and infer the segmentation - full glory: you have to learn coins and everything else
nice to compare with human judgments
Some version of this could also be a class project. Copied there...
Patterns of inference: Screening off and explaining away
Key point is when you condition, things which were independent become dependent, or things which were dependent become independent.
The first is most familiar: screening off.
leave late -> arrive airport on time -> missed plane
context-specific independence: vision example, the colored mach-card.
(also, useful for prediction and learning. physics is like this... tracking...)
the second is more interesting perhaps: explaining away.
for many causal structures, this dependence is an anticorrelation:
this happens if you have additive or disjunctive causes
e.g. simple medical diagnosis (breast cancer, benign cyst) see how they become coupled.
this anticorrelation is expressed in belief dynamics as we get additional data.
told no cyst.
anti-correlation can be pushed by other data, in an interesting reasoning dynamics. give other information: sister had a benign cyst... other symptoms: feels hard, or biopsy
not just anti-correlation. to see this consider a mathy example:
(define take-sample (lambda ()
(rejection-query (define a (uniform-draw (iota 10))) (define b (uniform-draw (iota 10)))
(list a b)
(equal? (+ a b)) )))
(hist (repeat 500 take-sample) "a, b")
correlation: (equal? a b)
complex: (equal? 3 (abs (- a b))) (equal? 3 (modulo (- a b) 10)) (equal? (modulo a 2) (modulo b 2)) (equal? (modulo a 5) (modulo b 5))
vision example: condition on it being darker there by the ratio you can see at the shadow boundary: illumination difference is equal to the luminance distance there.
nonmonotonic logic: classically, confidence can only increase. two states: unknown and known, and as you learn more, you can only go from unknown to known. but in common sense, you have a continuous spectrum of confidence and you can move up and down nonmonotonically.
students attribution.
Bayesian Occam's Razor
Explaining the force of inference: the Law of Conservation of Belief.
Finding structure in the world: Bayes occam's razor.
size principle:
a, b, c, ... subset examples learning curve: 1, 3, 10
number game.
multiples: 2, 3 shot learning
intevals: slow learning
(show figure of a hybrid model)
this is a special case of the likelihood principle.
what we saw with coin flipping, 0.5 vs. 0.95.
both size principle and likelihood principle derive from
a more general notion: law of conservation of belief.
cognitive examples: implicit negative evidence
size principle applied to concept and words learning
show word learning figure from tics
explain cancellation with violation of random sampling.
syntax acquisition
Now, for most cases of finding structure in the world, you need
to have a more complex hypothesis space with differeint amounts of structure. you need to decide about hypothesis complexity, and balance complexity with fit to the data.
the ideas behind the size principle or the likelihood principle, the law of conservation of belief basically, can be extended to this case using models with multiple draws. -> bayes occam
model selection: choose between a fair coin and weighted coin
towards bayes occam's razor / minimum description length simple 1-20 number game with 0, 1 or 2 free parameters.
scene inference
curve fitting.
Hierarchical Models
Add discussion of abstractness as 'distance from (perceptual) evidence' (hierarchical notion) and 'function with more variables' (lambda notion). It is often the case that lambda abstraction will imply hierarchical abstraction, since the lambda term will be applied to generate observed data. (Tree-like dependence can come in from either of these structures.)
Discuss two different factors that can drive the blessing of abstraction: smaller hypothesis spaces and pooling of statistical evidence.
Generate convincing learning curve plots for BoA in the marbles setting. (Use expected squared error from truth as measure of how much has been learned.)
Nonparametric models
In drawing a sample from the dp mixture, give an example of the high school lunch room. The way MH query is actually by hypothetically taking a person out and seeing where they go.
Recursive models
recursion follows from untyped lambda, though representational bias may be different.
talk about recursion, productivity, and parsimony. to generate an infinite set of (structurally) different representations it is necessary to use recursion. to generate a very large (combinatorial) set of representations it is not. however describing the large set of representations is more parsimonious under recursion -- the recursive representation will thus be preferred by bayes occam's razor.
tecumseh, fitch, hauser stuff?
Compositional concept learning
Give extensions of rational rules to inferring simple motor programs from "handwritten characters" on a 3 x 3 retina.
Nested query
vending machine example for goal and belief inference, and bob's box...
