Speed/accuracy trade-off between the habitual and the goal-directed processes - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 May;7(5):e1002055.
doi: 10.1371/journal.pcbi.1002055. Epub 2011 May 26.

Speed/accuracy trade-off between the habitual and the goal-directed processes

Affiliations

Speed/accuracy trade-off between the habitual and the goal-directed processes

Mehdi Keramati et al. PLoS Comput Biol. 2011 May.

Abstract

Instrumental responses are hypothesized to be of two kinds: habitual and goal-directed, mediated by the sensorimotor and the associative cortico-basal ganglia circuits, respectively. The existence of the two heterogeneous associative learning mechanisms can be hypothesized to arise from the comparative advantages that they have at different stages of learning. In this paper, we assume that the goal-directed system is behaviourally flexible, but slow in choice selection. The habitual system, in contrast, is fast in responding, but inflexible in adapting its behavioural strategy to new conditions. Based on these assumptions and using the computational theory of reinforcement learning, we propose a normative model for arbitration between the two processes that makes an approximately optimal balance between search-time and accuracy in decision making. Behaviourally, the model can explain experimental evidence on behavioural sensitivity to outcome at the early stages of learning, but insensitivity at the later stages. It also explains that when two choices with equal incentive values are available concurrently, the behaviour remains outcome-sensitive, even after extensive training. Moreover, the model can explain choice reaction time variations during the course of learning, as well as the experimental observation that as the number of choices increases, the reaction time also increases. Neurobiologically, by assuming that phasic and tonic activities of midbrain dopamine neurons carry the reward prediction error and the average reward signals used by the model, respectively, the model predicts that whereas phasic dopamine indirectly affects behaviour through reinforcing stimulus-response associations, tonic dopamine can directly affect behaviour through manipulating the competition between the habitual and the goal-directed systems and thus, affect reaction time.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. An example for showing the proposed arbitration mechanism between the two processes.
(A) The agent is at state formula image and three choices are available: formula image, formula image and formula image. The habitual system, as shown, has an estimate for the value of each action in the form of probability distribution functions, based on its previous experiences. These uncertain estimated values are then compared to each other in order to calculate the expected gain of having the exact value of each action (formula image). In the case of this example, action formula image has the highest mean value, according to the uncertain knowledge in the habitual system. However, it is probable that the exact value of this action be less than the mean value of action formula image. In that case, the best strategy would be to choose action formula image, rather that formula image. Thus, it is worth knowing the exact value of formula image (formula image has a high value). (B) The exact value of actions is supposed to be attainable if a tree search is performed in the decision tree, by the goal-directed system. However, the benefit of search must be higher than its cost. The benefit of deliberation for each action is equal to its formula image signal, whereas the cost of deliberation is equal to formula image, which is the total reward that could be potentially acquired during the deliberation time, formula image (formula image is the average over acquired rewards during some past actions). Since for action formula image, the benefit of deliberation has exceeded its cost, the goal-directed system is engaged in value estimation. (C) Finally, action selection is carried out based on the estimated values for actions, which have come from either the habitual (for actions formula image and formula image) or the goal-directed (for action formula image) system. For those actions that are not deliberated, the mean value of their distribution function is used for action selection.
Figure 2
Figure 2. Formal representation of the devaluation experiment with one lever and one outcome, and behavioural results.
(A) In the training phase, the animal is put in a Skinner box where pressing the lever formula image followed by a nose-poke entry in the food magazine (enter-magazine: formula image) leads to obtaining the food reward. Other action sequences, like entering the magazine before pressing the lever (i.e. formula image) result in no reward. As the task is supposed to be cyclic, the agent will return back to the initial state, formula image, after taking each sequence of responses. (B) In the second phase, the devaluation phase, the food outcome which used to be acquired during the training period is devalued by being paired with illness. (C) The animal's behaviour is then tested in the same Skinner box used for training, with the difference that no outcome is delivered to the animal anymore, in order to avoid changes in behaviour due to new reinforcement. (D) Behavioural results (adopted from ref [22]) show that the rate of pressing the lever decreases significantly after devaluation for the case of moderate pre-devaluation training. In contrast, it doesn't show a significant change, when the training period has been extensive. Error bars represent formula image (standard error of the mean).
Figure 3
Figure 3. Simulation results of the model in the schedule depicted in Figure 2 .
The model is simulated under two scenarios: moderate training (left column), and extensive training (right column). In the moderate training scenario, the agent has experienced the environment for 40 trials before devaluation treatment, whereas in the extensive training scenario, 240 pre-devaluation training trials have been provided. In sum, the figure shows that after extensive training, but not moderate training, the formula image signal is below formula image at the time of devaluation (Plot formula image against formula image). Thus, the behaviour in the second scenario, but not the first, doesn't change right after devaluation (Plot formula image against formula image. Also, plot formula image against formula image). The low value of the formula image signal at the time of devaluation for the second scenario is because there is little overlap between the distribution functions of the values of the two available choices (Plots formula image and formula image). The opposite is true for the first scenario (Plots formula image and formula image). Numbers along the horizontal axis in plots formula image to formula image, and formula image to formula image, represent trial numbers. Each “trial” ends when the simulated agent receives a reward; e.g. in the schedule of Figure 2 , each time the agent chooses formula image at state formula image, the trial number is counted up. Plots formula image and formula image show the distribution functions of the habitual system over its estimated formula image-values, at one trial before devaluation. Bar charts formula image and formula image show the average probability of performing formula image at 10 trials before (filled bars) and 10 trials after (empty bars) devaluation. All data reported are means over 3000 runs. The formula image for all bar charts is close to zero and thus, not illustrated.
Figure 4
Figure 4. Tree representation of the devaluation experiment with two levers available concurrently.
(A) In the training phase, either pressing lever one formula image or pressing lever two formula image, if followed by entering the magazine formula image, results in acquiring one unit of either of the two rewards, formula image or formula image, respectively. The reinforcing value of the two rewards is equal to one. Other action sequences lead to no reward. As in the task of Figure 2 , this task is also assumed to be cyclic. (B) In the devaluation phase, the outcome of one of the responses (formula image) is devalued (formula image), whereas the rewarding value of the outcome of the other response (formula image) has remained unchanged. After the devaluation phase, the animal's behaviour is tested in extinction (for space consideration, this phase is not illustrated). Similar to the task of Figure 2 , neither formula image nor formula image is delivered to the animal in the test phase.
Figure 5
Figure 5. Simulation results for the task of Figure 4 .
The results show that since the reinforcing value of the two outcomes is equal, there is a huge overlap between the distribution functions over the formula image-values of actions formula image and formula image, at state formula image, even after extensive training (240 trials) (Plots formula image and formula image). Accordingly, the formula image signals (benefit of goal-directed deliberation) for these two actions remain higher than the formula image signal (cost of deliberation) (Plot formula image) and thus, the goal-directed system is always engaged in value-estimation for these two choices. The behaviourally observable result is that responding remains sensitive to revaluation of outcomes, even though devaluation has happened after a prolonged training period (Plots formula image and formula image).
Figure 6
Figure 6. Tree representation of the reversal learning task, used in , and the behavioural results.
(A) When each trial begins, one of the two stimuli, formula image or formula image, is presented in random on a screen. The subject can then choose whether to touch the screen (formula image action) or not (formula image action). The task is performed in three phases: training, reversal, and extinction. During the training phase, the subject will receive a reward if the stimulus formula image is presented and the action formula image is performed by the subject, or if the stimulus formula image is presented and the action formula image is selected (formula image). During the reversal phase, the reward function is reversed, meaning that the formula image action must be chosen when the stimulus formula image is presented, and vice versa (formula image). Finally, during the extinction phase, regardless of the presented stimulus, only the formula image action leads to a reward (formula image). (B) During both the training and reversal phases, subjects' reaction time is high at the early stages when they don't have enough experience with the new conditions yet. However, after some trials, the reaction time declines significantly. Error bars represent formula image.
Figure 7
Figure 7. Simulation results of the model in the reversal learning task depicted in Figure 6 .
Since the formula image signals have high values at the early stages of learning (plot formula image), the goal-directed system is active and thus, the deliberation time is relatively high (plot formula image). After further training, the habitual system takes control over behaviour (plot formula image) and as a result, the model's reaction time decreases (plot formula image). After reversal, it takes some trials for the habitual system to realize that the cached formula image-values are not precise anymore (equivalent to an increase in the variance of formula image). Thus, after some trials after reversal, the formula image signal increases again (plot formula image), which results in re-activation of the goal-directed system. As a result, the model's reaction time increases again (plot formula image). A similar explanation holds for the rest of the trials. In sum, consistent with the experimental data, the reaction time is higher during the searching period, than the applying period.
Figure 8
Figure 8. The tree representation of the task for testing the Hick's law.
In this example, at each trial, one of the four stimuli is presented with equal probabilities. After observing the stimulus, only one of four available choices lead to a reward (formula image). The task structure is verbally instructed to the subjects before they start performing the task. The interval between the appearance of the stimulus and the initiation of a response is measured as “reaction time”. The experiment is performed under different numbers of stimulus-response pairs; e.g. some subjects perform the task when only one stimulus-response pair is available (formula image), whereas for other subjects the number of stimulus-response pairs might be different.
Figure 9
Figure 9. Simulation results for the task of Figure 8 .
Consistent with the behavioural data, the results show that as the number of stimulus-response pairs increase, the reaction time also increases. Moreover, if extensive training is provided to the subjects, the reaction time decreases and becomes independent from the number of choices.
Figure 10
Figure 10. An experiment for testing the validity of the model.
The proposed model predicts that manipulating the knowledge acquired by the goal-directed system should not affect the goal-directedness of behaviours. To test this prediction, a place/response task can be used. (A) In the first phase, the animal is moderately trained to acquire food reward in a T-maze. Since this training is moderate, the goal-directed system is expected to control behaviour during this phase. (B) In the second phase, the uncertainty of the goal-directed system is increased by putting the animal inside the right arm for some few trials, while the food reward comes at random or is totally removed. (C) Since the second phase doesn't have any effect on the habitual system, our model predicts that the arbitration between the system must have remained intact and thus, responding should still be goal-directed in the third phase. For that, the animal should still chose turning toward the window, even though its starting point is at the opposite end of the maze.

Similar articles

Cited by

References

    1. Rangel A, Camerer C, Montague PR. A framework for studying the neurobiology of valuebased decision making. Nat Rev Neurosci. 2008;9:545–556. - PMC - PubMed
    1. Dickinson A, Balleine BW. The role of learning in motivation. In: Gallistel CR, editor. Steven's Handbook of Experimental Psychology: Learning, Motivation, and Emotion. New York: Wiley; 2002. pp. 497–533. Volume 3. 3rd edition.
    1. Adams CD. Variations in the sensitivity of instrumental responding to reinforcer devaluation. Q J Exp Psychol. 1982;34:77–98.
    1. Balleine BW, O'Doherty JP. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacol. 2010;35:48–69. - PMC - PubMed
    1. Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8:1704–11. - PubMed

LinkOut - more resources