Speed/accuracy trade-off between the habitual and the goal-directed processes

doi:10.1371/journal.pcbi.1002055

. 2011 May;7(5):e1002055.

doi: 10.1371/journal.pcbi.1002055. Epub 2011 May 26.

Speed/accuracy trade-off between the habitual and the goal-directed processes

Mehdi Keramati¹, Amir Dezfouli, Payam Piray

Affiliations

PMID: 21637741
PMCID: PMC3102758
DOI: 10.1371/journal.pcbi.1002055

Speed/accuracy trade-off between the habitual and the goal-directed processes

Mehdi Keramati et al. PLoS Comput Biol. 2011 May.

. 2011 May;7(5):e1002055.

doi: 10.1371/journal.pcbi.1002055. Epub 2011 May 26.

Authors

Mehdi Keramati¹, Amir Dezfouli, Payam Piray

Affiliation

¹ School of Management and Economics, Sharif University of Technology, Tehran, Iran. mohammadmahdi.keramati@ens.fr

PMID: 21637741
PMCID: PMC3102758
DOI: 10.1371/journal.pcbi.1002055

Abstract

Instrumental responses are hypothesized to be of two kinds: habitual and goal-directed, mediated by the sensorimotor and the associative cortico-basal ganglia circuits, respectively. The existence of the two heterogeneous associative learning mechanisms can be hypothesized to arise from the comparative advantages that they have at different stages of learning. In this paper, we assume that the goal-directed system is behaviourally flexible, but slow in choice selection. The habitual system, in contrast, is fast in responding, but inflexible in adapting its behavioural strategy to new conditions. Based on these assumptions and using the computational theory of reinforcement learning, we propose a normative model for arbitration between the two processes that makes an approximately optimal balance between search-time and accuracy in decision making. Behaviourally, the model can explain experimental evidence on behavioural sensitivity to outcome at the early stages of learning, but insensitivity at the later stages. It also explains that when two choices with equal incentive values are available concurrently, the behaviour remains outcome-sensitive, even after extensive training. Moreover, the model can explain choice reaction time variations during the course of learning, as well as the experimental observation that as the number of choices increases, the reaction time also increases. Neurobiologically, by assuming that phasic and tonic activities of midbrain dopamine neurons carry the reward prediction error and the average reward signals used by the model, respectively, the model predicts that whereas phasic dopamine indirectly affects behaviour through reinforcing stimulus-response associations, tonic dopamine can directly affect behaviour through manipulating the competition between the habitual and the goal-directed systems and thus, affect reaction time.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. An example for showing the proposed arbitration mechanism between the two processes.**
(A) The agent is at state and three choices are available: , and . The habitual system, as shown, has an estimate for the value of each action in the form of probability distribution functions, based on its previous experiences. These uncertain estimated values are then compared to each other in order to calculate the expected gain of having the exact value of each action (). In the case of this example, action has the highest mean value, according to the uncertain knowledge in the habitual system. However, it is probable that the exact value of this action be less than the mean value of action . In that case, the best strategy would be to choose action , rather that . Thus, it is worth knowing the exact value of ( has a high value). (B) The exact value of actions is supposed to be attainable if a tree search is performed in the decision tree, by the goal-directed system. However, the benefit of search must be higher than its cost. The benefit of deliberation for each action is equal to its signal, whereas the cost of deliberation is equal to , which is the total reward that could be potentially acquired during the deliberation time, ( is the average over acquired rewards during some past actions). Since for action , the benefit of deliberation has exceeded its cost, the goal-directed system is engaged in value estimation. (C) Finally, action selection is carried out based on the estimated values for actions, which have come from either the habitual (for actions and ) or the goal-directed (for action ) system. For those actions that are not deliberated, the mean value of their distribution function is used for action selection.

formula image — **Figure 1. An example for showing the proposed arbitration mechanism between the two processes.**
(A) The agent is at state and three choices are available: , and . The habitual system, as shown, has an estimate for the value of each action in the form of probability distribution functions, based on its previous experiences. These uncertain estimated values are then compared to each other in order to calculate the expected gain of having the exact value of each action (). In the case of this example, action has the highest mean value, according to the uncertain knowledge in the habitual system. However, it is probable that the exact value of this action be less than the mean value of action . In that case, the best strategy would be to choose action , rather that . Thus, it is worth knowing the exact value of ( has a high value). (B) The exact value of actions is supposed to be attainable if a tree search is performed in the decision tree, by the goal-directed system. However, the benefit of search must be higher than its cost. The benefit of deliberation for each action is equal to its signal, whereas the cost of deliberation is equal to , which is the total reward that could be potentially acquired during the deliberation time, ( is the average over acquired rewards during some past actions). Since for action , the benefit of deliberation has exceeded its cost, the goal-directed system is engaged in value estimation. (C) Finally, action selection is carried out based on the estimated values for actions, which have come from either the habitual (for actions and ) or the goal-directed (for action ) system. For those actions that are not deliberated, the mean value of their distribution function is used for action selection.

**Figure 2. Formal representation of the devaluation experiment with one lever and one outcome, and behavioural results.**
(A) In the training phase, the animal is put in a Skinner box where pressing the lever followed by a nose-poke entry in the food magazine (enter-magazine: ) leads to obtaining the food reward. Other action sequences, like entering the magazine before pressing the lever (i.e. ) result in no reward. As the task is supposed to be cyclic, the agent will return back to the initial state, , after taking each sequence of responses. (B) In the second phase, the devaluation phase, the food outcome which used to be acquired during the training period is devalued by being paired with illness. (C) The animal's behaviour is then tested in the same Skinner box used for training, with the difference that no outcome is delivered to the animal anymore, in order to avoid changes in behaviour due to new reinforcement. (D) Behavioural results (adopted from ref [22]) show that the rate of pressing the lever decreases significantly after devaluation for the case of moderate pre-devaluation training. In contrast, it doesn't show a significant change, when the training period has been extensive. Error bars represent (standard error of the mean).

**Figure 3. Simulation results of the model in the schedule depicted in **Figure 2** .**
The model is simulated under two scenarios: moderate training (left column), and extensive training (right column). In the moderate training scenario, the agent has experienced the environment for 40 trials before devaluation treatment, whereas in the extensive training scenario, 240 pre-devaluation training trials have been provided. In sum, the figure shows that after extensive training, but not moderate training, the signal is below at the time of devaluation (Plot against ). Thus, the behaviour in the second scenario, but not the first, doesn't change right after devaluation (Plot against . Also, plot against ). The low value of the signal at the time of devaluation for the second scenario is because there is little overlap between the distribution functions of the values of the two available choices (Plots and ). The opposite is true for the first scenario (Plots and ). Numbers along the horizontal axis in plots to , and to , represent trial numbers. Each “trial” ends when the simulated agent receives a reward; e.g. in the schedule of Figure 2 , each time the agent chooses at state , the trial number is counted up. Plots and show the distribution functions of the habitual system over its estimated -values, at one trial before devaluation. Bar charts and show the average probability of performing at 10 trials before (filled bars) and 10 trials after (empty bars) devaluation. All data reported are means over 3000 runs. The for all bar charts is close to zero and thus, not illustrated.

**Figure 4. Tree representation of the devaluation experiment with two levers available concurrently.**
(A) In the training phase, either pressing lever one or pressing lever two , if followed by entering the magazine , results in acquiring one unit of either of the two rewards, or , respectively. The reinforcing value of the two rewards is equal to one. Other action sequences lead to no reward. As in the task of Figure 2 , this task is also assumed to be cyclic. (B) In the devaluation phase, the outcome of one of the responses () is devalued (), whereas the rewarding value of the outcome of the other response () has remained unchanged. After the devaluation phase, the animal's behaviour is tested in extinction (for space consideration, this phase is not illustrated). Similar to the task of Figure 2 , neither nor is delivered to the animal in the test phase.

**Figure 5. Simulation results for the task of **Figure 4** .**
The results show that since the reinforcing value of the two outcomes is equal, there is a huge overlap between the distribution functions over the -values of actions and , at state , even after extensive training (240 trials) (Plots and ). Accordingly, the signals (benefit of goal-directed deliberation) for these two actions remain higher than the signal (cost of deliberation) (Plot ) and thus, the goal-directed system is always engaged in value-estimation for these two choices. The behaviourally observable result is that responding remains sensitive to revaluation of outcomes, even though devaluation has happened after a prolonged training period (Plots and ).

**Figure 6. Tree representation of the reversal learning task, used in , and the behavioural results.**
(A) When each trial begins, one of the two stimuli, or , is presented in random on a screen. The subject can then choose whether to touch the screen ( action) or not ( action). The task is performed in three phases: training, reversal, and extinction. During the training phase, the subject will receive a reward if the stimulus is presented and the action is performed by the subject, or if the stimulus is presented and the action is selected (). During the reversal phase, the reward function is reversed, meaning that the action must be chosen when the stimulus is presented, and vice versa (). Finally, during the extinction phase, regardless of the presented stimulus, only the action leads to a reward (). (B) During both the training and reversal phases, subjects' reaction time is high at the early stages when they don't have enough experience with the new conditions yet. However, after some trials, the reaction time declines significantly. Error bars represent .

**Figure 7. Simulation results of the model in the reversal learning task depicted in **Figure 6** .**
Since the signals have high values at the early stages of learning (plot ), the goal-directed system is active and thus, the deliberation time is relatively high (plot ). After further training, the habitual system takes control over behaviour (plot ) and as a result, the model's reaction time decreases (plot ). After reversal, it takes some trials for the habitual system to realize that the cached -values are not precise anymore (equivalent to an increase in the variance of ). Thus, after some trials after reversal, the signal increases again (plot ), which results in re-activation of the goal-directed system. As a result, the model's reaction time increases again (plot ). A similar explanation holds for the rest of the trials. In sum, consistent with the experimental data, the reaction time is higher during the searching period, than the applying period.

**Figure 8. The tree representation of the task for testing the Hick's law.**
In this example, at each trial, one of the four stimuli is presented with equal probabilities. After observing the stimulus, only one of four available choices lead to a reward (). The task structure is verbally instructed to the subjects before they start performing the task. The interval between the appearance of the stimulus and the initiation of a response is measured as “reaction time”. The experiment is performed under different numbers of stimulus-response pairs; e.g. some subjects perform the task when only one stimulus-response pair is available (), whereas for other subjects the number of stimulus-response pairs might be different.

**Figure 9. Simulation results for the task of **Figure 8** .**
Consistent with the behavioural data, the results show that as the number of stimulus-response pairs increase, the reaction time also increases. Moreover, if extensive training is provided to the subjects, the reaction time decreases and becomes independent from the number of choices.

**Figure 10. An experiment for testing the validity of the model.**
The proposed model predicts that manipulating the knowledge acquired by the goal-directed system should not affect the goal-directedness of behaviours. To test this prediction, a place/response task can be used. (A) In the first phase, the animal is moderately trained to acquire food reward in a T-maze. Since this training is moderate, the goal-directed system is expected to control behaviour during this phase. (B) In the second phase, the uncertainty of the goal-directed system is increased by putting the animal inside the right arm for some few trials, while the food reward comes at random or is totally removed. (C) Since the second phase doesn't have any effect on the habitual system, our model predicts that the arbitration between the system must have remained intact and thus, responding should still be goal-directed in the third phase. For that, the animal should still chose turning toward the window, even though its starting point is at the opposite end of the maze.

See this image and copyright information in PMC

Cited by

Effects of subclinical depression on prefrontal-striatal model-based and model-free learning.
Heo S, Sung Y, Lee SW. Heo S, et al. PLoS Comput Biol. 2021 May 14;17(5):e1009003. doi: 10.1371/journal.pcbi.1009003. eCollection 2021 May. PLoS Comput Biol. 2021. PMID: 33989284 Free PMC article.
Reduced model-based decision-making in gambling disorder.
Wyckmans F, Otto AR, Sebold M, Daw N, Bechara A, Saeremans M, Kornreich C, Chatard A, Jaafari N, Noël X. Wyckmans F, et al. Sci Rep. 2019 Dec 23;9(1):19625. doi: 10.1038/s41598-019-56161-z. Sci Rep. 2019. PMID: 31873133 Free PMC article. Clinical Trial.
Model-Based and Model-Free Replay Mechanisms for Reinforcement Learning in Neurorobotics.
Massi E, Barthélemy J, Mailly J, Dromnelle R, Canitrot J, Poniatowski E, Girard B, Khamassi M. Massi E, et al. Front Neurorobot. 2022 Jun 24;16:864380. doi: 10.3389/fnbot.2022.864380. eCollection 2022. Front Neurorobot. 2022. PMID: 35812782 Free PMC article.
Modulating Visuomotor Sequence Learning by Repetitive Transcranial Magnetic Stimulation: What Do We Know So Far?
Szücs-Bencze L, Vékony T, Pesthy O, Szabó N, Kincses TZ, Turi Z, Nemeth D. Szücs-Bencze L, et al. J Intell. 2023 Oct 13;11(10):201. doi: 10.3390/jintelligence11100201. J Intell. 2023. PMID: 37888433 Free PMC article. Review.
Action-value comparisons in the dorsolateral prefrontal cortex control choice between goal-directed actions.
Morris RW, Dezfouli A, Griffiths KR, Balleine BW. Morris RW, et al. Nat Commun. 2014 Jul 23;5:4390. doi: 10.1038/ncomms5390. Nat Commun. 2014. PMID: 25055179 Free PMC article.

See all "Cited by" articles

References

1. Rangel A, Camerer C, Montague PR. A framework for studying the neurobiology of valuebased decision making. Nat Rev Neurosci. 2008;9:545–556. - PMC - PubMed
1. Dickinson A, Balleine BW. The role of learning in motivation. In: Gallistel CR, editor. Steven's Handbook of Experimental Psychology: Learning, Motivation, and Emotion. New York: Wiley; 2002. pp. 497–533. Volume 3. 3rd edition.
1. Adams CD. Variations in the sensitivity of instrumental responding to reinforcer devaluation. Q J Exp Psychol. 1982;34:77–98.
1. Balleine BW, O'Doherty JP. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacol. 2010;35:48–69. - PMC - PubMed
1. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8:1704–11. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- ModelDB

[1] Rangel A, Camerer C, Montague PR. A framework for studying the neurobiology of valuebased decision making. Nat Rev Neurosci. 2008;9:545–556. - PMC - PubMed

[2] Rangel A, Camerer C, Montague PR. A framework for studying the neurobiology of valuebased decision making. Nat Rev Neurosci. 2008;9:545–556. - PMC - PubMed

[3] Dickinson A, Balleine BW. The role of learning in motivation. In: Gallistel CR, editor. Steven's Handbook of Experimental Psychology: Learning, Motivation, and Emotion. New York: Wiley; 2002. pp. 497–533. Volume 3. 3rd edition.

[4] Dickinson A, Balleine BW. The role of learning in motivation. In: Gallistel CR, editor. Steven's Handbook of Experimental Psychology: Learning, Motivation, and Emotion. New York: Wiley; 2002. pp. 497–533. Volume 3. 3rd edition.

[5] Adams CD. Variations in the sensitivity of instrumental responding to reinforcer devaluation. Q J Exp Psychol. 1982;34:77–98.

[6] Adams CD. Variations in the sensitivity of instrumental responding to reinforcer devaluation. Q J Exp Psychol. 1982;34:77–98.

[7] Balleine BW, O'Doherty JP. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacol. 2010;35:48–69. - PMC - PubMed

[8] Balleine BW, O'Doherty JP. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacol. 2010;35:48–69. - PMC - PubMed

[9] Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8:1704–11. - PubMed

[10] Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8:1704–11. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Speed/accuracy trade-off between the habitual and the goal-directed processes

Affiliation

Speed/accuracy trade-off between the habitual and the goal-directed processes

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases