Download Animal Learning: Goal-Directed vs Habitual Behavior and Neural Dissociations and more Study Guides, Projects, Research Decision Making in PDF only on Docsity!
Instrumental Conditioning VI:
There is more than one kind of learning
PSY/NEU338: Animal learning and decision making:
Psychological, computational and neural perspectives
outline
• what goes into instrumental associations?
• goal directed versus habitual behavior
• neural dissociations between habitual and goal-
directed behavior
• how does all this fit in with reinforcement
learning?
what is associated with what? 3 Thorndike:
S R
reinforcer
Skinner: what is the S? Tolman:
S R
cognitive map Tolman Tolman:
S R
cognitive map
“ The stimuli are not connected by just
simple one-to-one switches to the outgoing
responses. Rather, the incoming impulses are
usually worked over and elaborated in the
central control room into a tentative,
cognitive-like map of the environment. And it
is this tentative map, indicating routes and
paths and environmental relationships, which
finally determines what responses, if any, the
animal will finally release. ”
another example: shortcuts 7
training: test:
Tolman et al (1946)
result:
summary so far...
- Even the humble rat can learn & internally represent spatial
structure, and use it to plan flexibly
- Tolman relates this to all of society
- Note that spatial tasks are really complicated & hard to control
- Next:^ search for modern versions of these effects
- Key question:^ is S-R model ever relevant? and what is there
beyond it? (especially important given what we know about RL)
the modern debate: S-R vs R-O
- S-R theory:
- parsimonious - same theory for Pavlovian conditioning (CS
associated with CR) and instrumental conditioning (stimulus
associated with response)
- but: the critical contingency in instrumental conditioning is
that of the response and the outcome…
- alternative:^ R-O theory^ (also called A-O)
- among proponents: Rescorla, Dickinson
- same spirit as Tolman (know ‘map’ of contingencies and
desires, can put 2+2 together)
9 How would you test this? outcome devaluation ?
Non-devalued
Unshifted
1 - Training:
??
3 – Test:
(extinction)
2 – Pairing with illness: 2 – Motivational shift:
Hungry Sated
Q1: why test without
rewards?
Q2: what do you
think will happen?
Q3: what would
Tolman/Thorndike
guess?
will animals
work for food
they don’t
want?
animals with lesions to DLS
never develop habits despite
extensive training
also treatments depleting
dopamine in DLS
also lesions to infralimbic
division of PFC (same
corticostriatal loop)
Yin et al (2004) 13 dorsolateral striatum lesion control (sham lesion)
overtrained rats
devaluation: results from lesions I
after habits have been formed,
devaluation sensitivity can be
reinstated by temporary
inactivation of IL PFC
devaluation: results from lesions II Coutureau & Killcross (2003) 14
IL PFC
inactivation (muscimol) control
overtrained rats
devaluation: results from lesions III
lesions of the pDMS cause
animals to leverpress habitually
even with only moderate
training
Yin, Ostlund, Knowlton & Balleine (2005) 15 devaluation: results from lesions IV
Prelimbic (PL) PFC lesions
cause animals to leverpress
habitually even with only
moderate training
(also dorsomedial PFC and
mediodorsal thalamus (same
loop))
moderate training
control devalued Killcross & Coutureau (2003) 16
0 5 10 0 5 10 Lever Presses Magazine Behavior moderate training extensive training, one outcome extensive training, two outcomes moderate training extensive training actions per minute^ actions per minute
behavior is not always consistent:
leverpressing is habitual and continues for unwanted food…
...at same time nosepoking is reduced (explanations?)
devaluation: one more result Kilcross & Coutureau (2003)^19 why are nosepokes always sensitive to devaluation?
rd system - Pavlovian behavior is
directly sensitive to outcome value
- But: doesn’t make sense... the Pavlovian system has
information that it is withholding from the instrumental
system?
- Also.. true for purely instrumental chain
- And anyway, it seems that all the information is around all
the time, so why is behavior not always goal-directed?
outline
- what goes into instrumental associations?
- goal directed versus habitual behavior
- neural dissociations between habitual and goal- directed behavior
- how does all this fit in with reinforcement learning? 21 back to RL framework for decisions
need to know long term consequences of
actions Q(S,a) in order to choose the best one
how can these be learned?
poke nose no food food in mag eating r = 1 press lever poke nose lever press nose poke lever press 3 states: “no food”, “food in mag”, “eating” 2 actions: “press lever”, “poke nose” immediate reward is 1 in state “eating” and 0 otherwise
strategy I1: “model-free” RL 25
S 0
S 1 S 2
L R
- Shortcut: store long-term values
- then simply retrieve them to choose action
- Can learn these from experience
- without building or searching a model
- incrementally through prediction errors
- dopamine dependent SARSA/Q-learning or Actor/Critic
Q(S 0 ,L) = 4
Q(S 0 ,R) = 2
Q(S 1 ,L) = 4
Q(S 1 ,R) = 0
Q(S 2 ,L) = 1
Q(S 2 ,R) = 2
Stored:
strategy I1: “model-free” RL
S 0
S 1 S 2
L R
- choosing actions is^ easy^ so behavior
is quick, reflexive (S-R)
- but needs^ a lot of experience^ to learn
- and^ inflexible, need relearning to
adapt to any change (habitual)
Q(S 0 ,L) = 4
Q(S 0 ,R) = 2
Q(S 1 ,L) = 4
Q(S 1 ,R) = 0
Q(S 2 ,L) = 1
Q(S 2 ,R) = 2
Stored:
two big questions 27
- Why should the brain use two different strategies/
controllers in parallel?
- If it uses two: how can it arbitrate between the two when
they disagree (new decision making problem…)
answers
- each system is best in different situations^ (use each one when it is
most suitable/most accurate)
- goal-directed (forward search) - good with limited training,
close to the reward (don’t have to search ahead too far)
- habitual (cache) - good after much experience, distance from
reward not so important
- arbitration: trust the system that is^ more confident^ in its
recommendation
- different sources of uncertainty in the
two systems
- compare to: always choose the highest value estimated action value model free model based