Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation

Dima, Simon; Fischer, Simon; Heitzig, Jobst; Oliver, Joss

doi:10.1007/978-3-031-73903-3_8

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15248))

Included in the following conference series:

International Conference on Algorithmic Decision Theory

84 Accesses

Abstract

In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent in a stochastic environment is usually determined by expressing the goal as a scalar reward function and seeking a policy that maximizes the expected total reward. However, many goals that humans care about naturally concern multiple aspects of the world, and it may not be obvious how to condense those into a single reward function. Furthermore, maximization suffers from specification gaming, where the obtained policy achieves a high expected total reward in an unintended way, often taking extreme or nonsensical actions.

Here we consider finite acyclic Markov Decision Processes with multiple distinct evaluation metrics, which do not necessarily represent quantities that the user wants to be maximized. We assume the task of the agent is to ensure that the vector of expected totals of the evaluation metrics falls into some given convex set, called the aspiration set. Our algorithm guarantees that this task is fulfilled by using simplices to approximate feasibility sets and propagate aspirations forward while ensuring they remain feasible. It has complexity linear in the number of possible state–action–successor triples and polynomial in the number of evaluation metrics. Moreover, the explicitly non-maximizing nature of the chosen policy and goals yields additional degrees of freedom, which can be used to apply heuristic safety criteria to the choice of actions. We discuss several such safety criteria that aim to steer the agent towards more conservative behavior.

S. Fischer and J. Oliver—Independent Researcher

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 6634; Price includes VAT (Japan)

Softcover Book: JPY 8293; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

What is Your Discount Factor?

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Semi-Markov Decision Processes with Vector Pay-Offs

Notes

1.
Nevertheless, our algorithm can also be straightforwardly adapted to learning.
2.
However, algorithm 1 remains correct even if the aspiration sets shrink to singletons.
3.
According to [17], the probability that we need exactly \(k\ge d+1\) iterations is \(f(k-1) - f(k)\) with \(f(k) = 2^{1-k} \sum _{\ell =0}^{d-1} \left( {\begin{array}{c}k-1\\ \ell \end{array}}\right) \). Hence \(\mathbb {E}k = \sum _{k=d+1}^{\infty }k(f(k-1) - f(k)) = (d+1) f(d) + \sum _{k=d+1}^\infty f(k)\). It is then reasonably simple to prove by induction that \(\sum _{k=d+1}^\infty f(k) = d\), and hence \(\mathbb {E}k = 2d+1\).
4.
Note however that some safety-related action selection criteria, especially those based on information-theoretic concepts, require access to transition probabilities which would then have to be learned in addition to the reference simplices.
5.
Farsighted action selection criteria would require an additional learning pass to also learn the actual policy and the resulting action evaluations.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016)
Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algorithms. In: IJCAI, pp. 1641–1646. Pasadena CA (2009)
Google Scholar
Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling (2021)
Google Scholar
Clymer, J., et al.: Generalization analogies (GENIES): a testbed for generalizing AI oversight to hard-to-measure domains. arXiv preprint arXiv:2311.07723 (2023)
Conitzer, V., et al.: Social choice for AI alignment: dealing with diverse human feedback. arXiv preprint arXiv:2404.10271 (2024)
Dalrymple, D., et al.: Towards guaranteed safe AI: a framework for ensuring robust and reliable AI systems. arXiv preprint arXiv:2405.06624 (2024)
Feinberg, E.A., Sonin, I.: Notes on equivalent stationary policies in Markov decision processes with total rewards. Math. Meth. Oper. Res. 44(2), 205–221 (1996). https://doi.org/10.1007/BF01194331
Kern-Isberner, G., Spohn, W.: Inductive reasoning, conditionals, and belief dynamics. J. Appl. Log. 2631(1), 89 (2024)
MathSciNet Google Scholar
Miryoosefi, S., Brantley, K., Daumé, H., Dudík, M., Schapire, R.E.: Reinforcement learning with convex constraints. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)
Google Scholar
Simon, H.A.: Rational choice and the structure of the environment. Psychol. Rev. 63(2), 129 (1956)
Article Google Scholar
Skalse, J.M.V., Farrugia-Roberts, M., Russell, S., Abate, A., Gleave, A.: Invariance in policy optimisation and partial identifiability in reward learning. In: International Conference on Machine Learning, pp. 32033–32058. PMLR (2023)
Google Scholar
Subramani, R., et al.: On the expressivity of objective-specification formalisms in reinforcement learning. arXiv preprint arXiv:2310.11840 (2023)
Taylor, J.: Quantilizers: a safer alternative to maximizers for limited optimization (2015). https://intelligence.org/files/QuantilizersSaferAlternative.pdf
Tschantz, A., et al.: Reinforcement learning through active inference (2020)
Google Scholar
Vaidya, P.: Speeding-up linear programming using fast matrix multiplication. In: 30th Annual Symposium on Foundations of Computer Science, pp. 332–337 (1989)
Google Scholar
Vamplew, P., Foale, C., Dazeley, R., Bignold, A.: Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. Eng. Appl. Artif. Intell. 100, 104186 (2021)
Article Google Scholar
Wendel, J.G.: A problem in geometric probability. Math. Scand. 11(1), 109–111 (1962)
Article MathSciNet Google Scholar
Yen, I.E.H., Zhong, K., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Sparse linear programming via primal and dual augmented coordinate descent. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar

Download references

Acknowledgments

We thank the members of the SatisfIA project, AI Safety Camp, the Supervised Program for Alignment Research, and the organizers of the Virtual AI Safety Unconference.

Author information

Authors and Affiliations

École Normale Supérieure, Paris, France
Simon Dima
Cologne, Germany
Simon Fischer
Potsdam Institute for Climate Impact Research, Potsdam, Germany
Jobst Heitzig
London, UK
Joss Oliver

Authors

Simon Dima
View author publications
You can also search for this author in PubMed Google Scholar
Simon Fischer
View author publications
You can also search for this author in PubMed Google Scholar
Jobst Heitzig
View author publications
You can also search for this author in PubMed Google Scholar
Joss Oliver
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joss Oliver .

Editor information

Editors and Affiliations

University of Virginia, Charlottesville, USA
Rupert Freeman
Tulane University, New Orleans, LA, USA
Nicholas Mattei

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare.

CRediT Author Statement

Authors are listed in alphabetical ordering and have contributed equally. Simon Dima: Formal analysis, Writing - Original Draft, Writing - Review & Editing. Simon Fischer: Formal analysis, Writing - Original Draft, Writing - Review & Editing. Jobst Heitzig: Conceptualization, Methodology, Software, Writing - Original Draft, Writing - Review & Editing, Supervision. Joss Oliver: Formal analysis, Writing - Original Draft, Writing - Review & Editing.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 401 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dima, S., Fischer, S., Heitzig, J., Oliver, J. (2025). Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation. In: Freeman, R., Mattei, N. (eds) Algorithmic Decision Theory. ADT 2024. Lecture Notes in Computer Science(), vol 15248. Springer, Cham. https://doi.org/10.1007/978-3-031-73903-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-73903-3_8
Published: 16 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73902-6
Online ISBN: 978-3-031-73903-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

What is Your Discount Factor?

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Semi-Markov Decision Processes with Vector Pay-Offs

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

CRediT Author Statement

1 Electronic supplementary material

Supplementary material 1 (pdf 401 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

What is Your Discount Factor?

Markov Decision Processes with Discounted Rewards: New Action Elimination Procedure

Semi-Markov Decision Processes with Vector Pay-Offs

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

CRediT Author Statement

1 Electronic supplementary material

Supplementary material 1 (pdf 401 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation