Structure learning in human sequential decision-making - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Dec 2;6(12):e1001003.
doi: 10.1371/journal.pcbi.1001003.

Structure learning in human sequential decision-making

Affiliations

Structure learning in human sequential decision-making

Daniel E Acuña et al. PLoS Comput Biol. .

Abstract

Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that has perfect knowledge of the model of how rewards and events are generated in the environment. Rather than being suboptimal, we argue that the learning problem humans face is more complex, in that it also involves learning the structure of reward generation in the environment. We formulate the problem of structure learning in sequential decision tasks using Bayesian reinforcement learning, and show that learning the generative model for rewards qualitatively changes the behavior of an optimal learning agent. To test whether people exhibit structure learning, we performed experiments involving a mixture of one-armed and two-armed bandit reward models, where structure learning produces many of the qualitative behaviors deemed suboptimal in previous studies. Our results demonstrate humans can perform structure learning in a near-optimal manner.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Different structures in sequential decision-making.
A) General structure. Arcs highlighted denote B) temporal dependency between success probabilities, C) action-dependent reward state leading to different optimality principles—from foraging to maximization and D) reward coupling affecting exploration vs. exploitation demands.
Figure 2
Figure 2. Graphical models of reward generation.
The agent faces formula image tasks, each comprising a random number formula image of choices. A) Rewarding options are independent. B) Rewarding options are coupled within a task. C) Mixture of tasks. Rewarding options may be independent or coupled. The node formula image acts as a “XOR” switch between coupled and independent structure.
Figure 3
Figure 3. Learning simulation of structure learning model.
Four tasks of 50 trials each are sequentially shown to the structure learning model. Priors were formula image and formula image. Marginal beliefs on reward probabilities (brightness indicates relative probability mass), probability of coupling and expected reward are shown as functions of time. A) Simulation on Independent Environment B) Simulation on Coupled Environment.
Figure 4
Figure 4. Effect of task uncertainty on exploration– exploitation of structure learning model.
The data available for the options are formula image, formula image, and formula image and discount factor formula image is 0.98, all values fixed for the simulation. The number of failures for option two (formula image) is varied from 1 through 3. Under these conditions, the independent would always choose option 1 whereas the coupled model would always choose option 2. However, the structure learning model switches between these two The graph shows the difference in values between the option 2 and 1 as a function of the task uncertainty.
Figure 5
Figure 5. Full behavior on diagnostic trials as a function of evidence and confidence.
Diagnostic trials are those in which there is at least one disagreement between the models. For each of these trials, we compute the evidence and confidence of each option. A cell in the graph indicates the empirical probability that the model (or participants) pick the better option as a function of evidence and confidence. The right panels show prediction rate of different models in diagnostic trials. All pair-wise differences are significant (formula image) A) Trials in Independent Environment B) Trials in Coupled Environment.
Figure 6
Figure 6. Better arm selection ratio.
In the diagnostic trials, A) and C) Belief in coupling tracks changes in participant choices similarly to the learning model B) and D) behavior vs. structure belief is well correlated with the learning model, but not with independent and coupled.
Figure 7
Figure 7. Model comparison in different aspects of decision-making.
A and B) Performance of learning model and coupled model for decisions not predicted by the independent model in the independent environment (separated into under-exploratory and over-exploratory trials) C) Prediction performance for trials where independent and coupled model prefer one option whereas the learning model prefers the other. These trials are called task learning trials.

Similar articles

Cited by

References

    1. Bellman RE. A problem in the sequential design of experiments. Sankhyā. 1956;16:221–229.
    1. Gittins JC. Multi-armed bandit allocation indices. Chichester [West Sussex]; New York: Wiley; 1989.
    1. Whittle P. Restless bandits: activity allocation in a changing world. J Appl Probab. 1988;25:287–298.
    1. Daw ND, O'Doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006;441:876–879. - PMC - PubMed
    1. Yi MS, Steyvers M, Lee M. Modeling human performance in restless bandits with particle filters. The Journal of Problem Solving. 2009;2 Available: http://docs.lib.purdue.edu/jps/vol2/iss2/5/

Publication types