Abstract
In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent in a stochastic environment is usually determined by expressing the goal as a scalar reward function and seeking a policy that maximizes the expected total reward. However, many goals that humans care about naturally concern multiple aspects of the world, and it may not be obvious how to condense those into a single reward function. Furthermore, maximization suffers from specification gaming, where the obtained policy achieves a high expected total reward in an unintended way, often taking extreme or nonsensical actions.
Here we consider finite acyclic Markov Decision Processes with multiple distinct evaluation metrics, which do not necessarily represent quantities that the user wants to be maximized. We assume the task of the agent is to ensure that the vector of expected totals of the evaluation metrics falls into some given convex set, called the aspiration set. Our algorithm guarantees that this task is fulfilled by using simplices to approximate feasibility sets and propagate aspirations forward while ensuring they remain feasible. It has complexity linear in the number of possible state–action–successor triples and polynomial in the number of evaluation metrics. Moreover, the explicitly non-maximizing nature of the chosen policy and goals yields additional degrees of freedom, which can be used to apply heuristic safety criteria to the choice of actions. We discuss several such safety criteria that aim to steer the agent towards more conservative behavior.
S. Fischer and J. Oliver—Independent Researcher
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Nevertheless, our algorithm can also be straightforwardly adapted to learning.
- 2.
However, algorithm 1 remains correct even if the aspiration sets shrink to singletons.
- 3.
According to [17], the probability that we need exactly \(k\ge d+1\) iterations is \(f(k-1) - f(k)\) with \(f(k) = 2^{1-k} \sum _{\ell =0}^{d-1} \left( {\begin{array}{c}k-1\\ \ell \end{array}}\right) \). Hence \(\mathbb {E}k = \sum _{k=d+1}^{\infty }k(f(k-1) - f(k)) = (d+1) f(d) + \sum _{k=d+1}^\infty f(k)\). It is then reasonably simple to prove by induction that \(\sum _{k=d+1}^\infty f(k) = d\), and hence \(\mathbb {E}k = 2d+1\).
- 4.
Note however that some safety-related action selection criteria, especially those based on information-theoretic concepts, require access to transition probabilities which would then have to be learned in addition to the reference simplices.
- 5.
Farsighted action selection criteria would require an additional learning pass to also learn the actual policy and the resulting action evaluations.
References
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016)
Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algorithms. In: IJCAI, pp. 1641–1646. Pasadena CA (2009)
Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling (2021)
Clymer, J., et al.: Generalization analogies (GENIES): a testbed for generalizing AI oversight to hard-to-measure domains. arXiv preprint arXiv:2311.07723 (2023)
Conitzer, V., et al.: Social choice for AI alignment: dealing with diverse human feedback. arXiv preprint arXiv:2404.10271 (2024)
Dalrymple, D., et al.: Towards guaranteed safe AI: a framework for ensuring robust and reliable AI systems. arXiv preprint arXiv:2405.06624 (2024)
Feinberg, E.A., Sonin, I.: Notes on equivalent stationary policies in Markov decision processes with total rewards. Math. Meth. Oper. Res. 44(2), 205–221 (1996). https://doi.org/10.1007/BF01194331
Kern-Isberner, G., Spohn, W.: Inductive reasoning, conditionals, and belief dynamics. J. Appl. Log. 2631(1), 89 (2024)
Miryoosefi, S., Brantley, K., Daumé, H., Dudík, M., Schapire, R.E.: Reinforcement learning with convex constraints. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)
Simon, H.A.: Rational choice and the structure of the environment. Psychol. Rev. 63(2), 129 (1956)
Skalse, J.M.V., Farrugia-Roberts, M., Russell, S., Abate, A., Gleave, A.: Invariance in policy optimisation and partial identifiability in reward learning. In: International Conference on Machine Learning, pp. 32033–32058. PMLR (2023)
Subramani, R., et al.: On the expressivity of objective-specification formalisms in reinforcement learning. arXiv preprint arXiv:2310.11840 (2023)
Taylor, J.: Quantilizers: a safer alternative to maximizers for limited optimization (2015). https://intelligence.org/files/QuantilizersSaferAlternative.pdf
Tschantz, A., et al.: Reinforcement learning through active inference (2020)
Vaidya, P.: Speeding-up linear programming using fast matrix multiplication. In: 30th Annual Symposium on Foundations of Computer Science, pp. 332–337 (1989)
Vamplew, P., Foale, C., Dazeley, R., Bignold, A.: Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. Eng. Appl. Artif. Intell. 100, 104186 (2021)
Wendel, J.G.: A problem in geometric probability. Math. Scand. 11(1), 109–111 (1962)
Yen, I.E.H., Zhong, K., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Sparse linear programming via primal and dual augmented coordinate descent. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Acknowledgments
We thank the members of the SatisfIA project, AI Safety Camp, the Supervised Program for Alignment Research, and the organizers of the Virtual AI Safety Unconference.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare.
CRediT Author Statement
Authors are listed in alphabetical ordering and have contributed equally. Simon Dima: Formal analysis, Writing - Original Draft, Writing - Review & Editing. Simon Fischer: Formal analysis, Writing - Original Draft, Writing - Review & Editing. Jobst Heitzig: Conceptualization, Methodology, Software, Writing - Original Draft, Writing - Review & Editing, Supervision. Joss Oliver: Formal analysis, Writing - Original Draft, Writing - Review & Editing.
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland
About this paper
Cite this paper
Dima, S., Fischer, S., Heitzig, J., Oliver, J. (2025). Non-maximizing Policies that Fulfill Multi-criterion Aspirations in Expectation. In: Freeman, R., Mattei, N. (eds) Algorithmic Decision Theory. ADT 2024. Lecture Notes in Computer Science(), vol 15248. Springer, Cham. https://doi.org/10.1007/978-3-031-73903-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-73903-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73902-6
Online ISBN: 978-3-031-73903-3
eBook Packages: Computer ScienceComputer Science (R0)