States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Randomized Controlled Trial
. 2010 May 27;66(4):585-95.
doi: 10.1016/j.neuron.2010.04.016.

States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning

Affiliations
Randomized Controlled Trial

States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning

Jan Gläscher et al. Neuron. .

Abstract

Reinforcement learning (RL) uses sequential experience with situations ("states") and outcomes to assess actions. Whereas model-free RL uses this experience directly, in the form of a reward prediction error (RPE), model-based RL uses it indirectly, building a model of the state transition and outcome structure of the environment, and evaluating actions by searching this model. A state prediction error (SPE) plays a central role, reporting discrepancies between the current model and the observed state transitions. Using functional magnetic resonance imaging in humans solving a probabilistic Markov decision task, we found the neural signature of an SPE in the intraparietal sulcus and lateral prefrontal cortex, in addition to the previously well-characterized RPE in the ventral striatum. This finding supports the existence of two unique forms of learning signal in humans, which may form the basis of distinct computational strategies for guiding behavior.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Task Design and Experimental Procedure. (A) The experimental task was a sequential two-choice Markov decision task in which all decision states are represented by fractal images. The task design follows that of a binary decision tree. Each trial begins in the same state. Subject can choose between a left (L) or right (R) button press. With a certain probability (0.7/0.3) they reach one of two subsequent states in which they can choose again between a left or right action. Finally, they reach one of three outcome states associated with different monetary rewards (0c, 10c, and 25c). (B) The experiment proceeded in 2 fMRI scanning sessions of 80 trials each. In the first session, subject choices were fixed and presented to them below the fractal image. However, subjects could still learn the transition probabilities. Between scanning sessions subjects were presented with the reward schedule that maps the outcome states to the monetary pay-offs. This mapping was rehearsed in a short choice task. Finally, in the second scanning session subjects were free to choose left or right actions in each state. In addition, they also received the pay-offs in the outcome states.
Figure 2
Figure 2
Theoretical model for data analysis. We used both a model-free SARSA learner and a model-based FORWARD learner to fit the behavioral data. SARSA computes a reward prediction error using cached values from the previous trials to update state-action values. The FORWARD learner on the other hand learns a model of the state space T(s,a,s’) by means of a state prediction error, which is then used to update the state transition matrix. Action values are derived by maximizing over the expected value at each state. In session 2, a HYBRID learner computes a combined action value as an exponentially weighted sum of the action values for the SARSA and FORWARD learner. The combined action value is then submitted to softmax action selection (see Experimental Procedures for details).
Figure 3
Figure 3
Neural representations of state (SPE) and reward prediction errors (RPE). The SPE is pooled across both scanning sessions, whereas the RPE is only available in the rewarded session 2. BOLD activation plots on the right are the average percent signal change (across subjects, error bars: s.e.m.) for those trials in which the prediction error (PE) is low, medium, or high (33rd, 66th, and 100th percentile PE range). Data are extracted using a cross-validation procedure (leave-one-out) from the nearest local maximum from the coordinates listed in the Table 2 (circled areas, see Experimental Procedures for details). red = SPE, green = RPE. (A) and (B) Significant effect for SPE bilaterally in the intraparietal sulcus (ips) and lateral pre-frontal cortex (lpfc). (C) Significant effects for RPE in the ventral striatum (vstr). Color codes in the SPMs correspond to p< 0.001 and p< 0.0001 uncorrected.
Figure 4
Figure 4
Neural representations of the state prediction error in pIPS and latPFC separately for both sessions. Data are extracted in the same way as in Figure 3 and plotted according for low, medium, and high SPE (see Experimental Procedures for details). Color codes in the SPMs correspond to p< 0.001 and p< 0.0001 uncorrected.
Figure 5
Figure 5
Relationship between BOLD correlates of a state prediction error in right pIPS and bilateral latPFC in session 1 and the percent correct choice (3 bins) at the beginning of session 2. A “correct choice” was defined as choosing the action with the highest optimal Q-value in a particular state (see Supplementary Figure 4 for details on the optimal Q-values). Error bars are s.e.m. across subjects.

Comment in

  • Nature. 2010 Jul 29;466(7306):535

Similar articles

Cited by

References

    1. Balleine BW, Delgado MR, Hikosaka O. The role of the dorsal striatum in reward and decision-making. J Neurosci. 2007;27:8161–8165. - PMC - PubMed
    1. Barraclough DJ, Conroy ML, Lee D. Prefrontal cortex and decision making in a mixed-strategy game. Nat Neurosci. 2004;7:404–410. - PubMed
    1. Bayer HM, Glimcher PW. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron. 2005;47:129–141. - PMC - PubMed
    1. Behrens TE, Woolrich MW, Walton ME, Rushworth MF. Learning the value of information in an uncertain world. Nat Neurosci. 2007;10:1214–1221. - PubMed
    1. Bucci DJ, Holland PC, Gallagher M. Removal of cholinergic input to rat posterior parietal cortex disrupts incremental processing of conditioned stimuli. J Neurosci. 1998;18:8038–8046. - PMC - PubMed

Publication types