Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation | SpringerLink
Skip to main content

Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation

  • Conference paper
  • First Online:
Algorithmic Decision Theory (ADT 2024)

Abstract

In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent in a stochastic environment is usually determined by expressing the goal as a scalar reward function and seeking a policy that maximizes the expected total reward. However, many goals that humans care about naturally concern multiple aspects of the world, and it may not be obvious how to condense those into a single reward function. Furthermore, maximization suffers from specification gaming, where the obtained policy achieves a high expected total reward in an unintended way, often taking extreme or nonsensical actions.

Here we consider finite acyclic Markov Decision Processes with multiple distinct evaluation metrics, which do not necessarily represent quantities that the user wants to be maximized. We assume the task of the agent is to ensure that the vector of expected totals of the evaluation metrics falls into some given convex set, called the aspiration set. Our algorithm guarantees that this task is fulfilled by using simplices to approximate feasibility sets and propagate aspirations forward while ensuring they remain feasible. It has complexity linear in the number of possible state–action–successor triples and polynomial in the number of evaluation metrics. Moreover, the explicitly non-maximizing nature of the chosen policy and goals yields additional degrees of freedom, which can be used to apply heuristic safety criteria to the choice of actions. We discuss several such safety criteria that aim to steer the agent towards more conservative behavior.

S. Fischer and J. Oliver—Independent Researcher

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 6634
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 8293
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Nevertheless, our algorithm can also be straightforwardly adapted to learning.

  2. 2.

    However, algorithm 1 remains correct even if the aspiration sets shrink to singletons.

  3. 3.

    According to [17], the probability that we need exactly \(k\ge d+1\) iterations is \(f(k-1) - f(k)\) with \(f(k) = 2^{1-k} \sum _{\ell =0}^{d-1} \left( {\begin{array}{c}k-1\\ \ell \end{array}}\right) \). Hence \(\mathbb {E}k = \sum _{k=d+1}^{\infty }k(f(k-1) - f(k)) = (d+1) f(d) + \sum _{k=d+1}^\infty f(k)\). It is then reasonably simple to prove by induction that \(\sum _{k=d+1}^\infty f(k) = d\), and hence \(\mathbb {E}k = 2d+1\).

  4. 4.

    Note however that some safety-related action selection criteria, especially those based on information-theoretic concepts, require access to transition probabilities which would then have to be learned in addition to the reference simplices.

  5. 5.

    Farsighted action selection criteria would require an additional learning pass to also learn the actual policy and the resulting action evaluations.

References

  1. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016)

  2. Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algorithms. In: IJCAI, pp. 1641–1646. Pasadena CA (2009)

    Google Scholar 

  3. Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling (2021)

    Google Scholar 

  4. Clymer, J., et al.: Generalization analogies (GENIES): a testbed for generalizing AI oversight to hard-to-measure domains. arXiv preprint arXiv:2311.07723 (2023)

  5. Conitzer, V., et al.: Social choice for AI alignment: dealing with diverse human feedback. arXiv preprint arXiv:2404.10271 (2024)

  6. Dalrymple, D., et al.: Towards guaranteed safe AI: a framework for ensuring robust and reliable AI systems. arXiv preprint arXiv:2405.06624 (2024)

  7. Feinberg, E.A., Sonin, I.: Notes on equivalent stationary policies in Markov decision processes with total rewards. Math. Meth. Oper. Res. 44(2), 205–221 (1996). https://doi.org/10.1007/BF01194331

  8. Kern-Isberner, G., Spohn, W.: Inductive reasoning, conditionals, and belief dynamics. J. Appl. Log. 2631(1), 89 (2024)

    MathSciNet  Google Scholar 

  9. Miryoosefi, S., Brantley, K., Daumé, H., Dudík, M., Schapire, R.E.: Reinforcement learning with convex constraints. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)

    Google Scholar 

  10. Simon, H.A.: Rational choice and the structure of the environment. Psychol. Rev. 63(2), 129 (1956)

    Article  Google Scholar 

  11. Skalse, J.M.V., Farrugia-Roberts, M., Russell, S., Abate, A., Gleave, A.: Invariance in policy optimisation and partial identifiability in reward learning. In: International Conference on Machine Learning, pp. 32033–32058. PMLR (2023)

    Google Scholar 

  12. Subramani, R., et al.: On the expressivity of objective-specification formalisms in reinforcement learning. arXiv preprint arXiv:2310.11840 (2023)

  13. Taylor, J.: Quantilizers: a safer alternative to maximizers for limited optimization (2015). https://intelligence.org/files/QuantilizersSaferAlternative.pdf

  14. Tschantz, A., et al.: Reinforcement learning through active inference (2020)

    Google Scholar 

  15. Vaidya, P.: Speeding-up linear programming using fast matrix multiplication. In: 30th Annual Symposium on Foundations of Computer Science, pp. 332–337 (1989)

    Google Scholar 

  16. Vamplew, P., Foale, C., Dazeley, R., Bignold, A.: Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. Eng. Appl. Artif. Intell. 100, 104186 (2021)

    Article  Google Scholar 

  17. Wendel, J.G.: A problem in geometric probability. Math. Scand. 11(1), 109–111 (1962)

    Article  MathSciNet  Google Scholar 

  18. Yen, I.E.H., Zhong, K., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Sparse linear programming via primal and dual augmented coordinate descent. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

Download references

Acknowledgments

We thank the members of the SatisfIA project, AI Safety Camp, the Supervised Program for Alignment Research, and the organizers of the Virtual AI Safety Unconference.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joss Oliver .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare.

CRediT Author Statement

Authors are listed in alphabetical ordering and have contributed equally. Simon Dima: Formal analysis, Writing - Original Draft, Writing - Review & Editing. Simon Fischer: Formal analysis, Writing - Original Draft, Writing - Review & Editing. Jobst Heitzig: Conceptualization, Methodology, Software, Writing - Original Draft, Writing - Review & Editing, Supervision. Joss Oliver: Formal analysis, Writing - Original Draft, Writing - Review & Editing.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 401 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dima, S., Fischer, S., Heitzig, J., Oliver, J. (2025). Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation. In: Freeman, R., Mattei, N. (eds) Algorithmic Decision Theory. ADT 2024. Lecture Notes in Computer Science(), vol 15248. Springer, Cham. https://doi.org/10.1007/978-3-031-73903-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73903-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73902-6

  • Online ISBN: 978-3-031-73903-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics