Abstract
How can we make sure that AI systems align with human values and norms? An important step towards reaching this goal is to develop a method for measuring value alignment in AI. Unless we can measure value alignment, we cannot adjudicate whether one AI is better aligned with human morality than another. The aim of this paper is to develop two quantitative measures of value alignment that estimate how well an AI system aligns with human values or norms. The theoretical basis of the measures we propose is the theory of conceptual spaces (Gärdenfors 1990, 2000, 2014, Douven and Gärdenfors 2020, and Strössner 2022). The key idea is to represent values and norms as geometric regions in multidimensional similarity spaces (Peterson 2017 and Verheyen & Peterson 2021). Using conceptual spaces for measuring value alignment has several advantages over alternative measures based on expected utility losses, because this does not require researchers to explicitly assign utilities to moral “losses” ex ante. As proof of concept, we apply our measures to three examples: ChatGPT-3, a medical AI classifier developed by Bajer et al., and finally to COMPAS, a controversial AI tool assisting judges in making bail and sentencing decisions. One of our findings is that ChatGPT-3 is so poorly aligned with human morality that it is pointless to apply our measures to it.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
EU High-Level Expert Group on AI.
See [36].
Jobin et al. [17] (2019: 389).
Gabriel and Ghazavi (2021: 9).
Kim et al. [18].
There exist several psychological tests for determining whether dimensions are integral or separable, see Maddox [20].
For the case of color properties, this hypothesis has been given strong support by Jäger [16].
Peterson [26] discusses the idea that conceptual spaces are useful for aligning AI with human values but offers no measure assessing the degree to which the conceptual space of an AI aligns with that of a human.
In addition, note that the alignment measure does not require any form of probability judgments.
The data reported here is for the March 2023 version of ChatGPT-3. We also tried the May 2023, version but observed the same instability and misalignment as reported here.
An asterisk means that no human similarity data are available.
In if the AUC value is exactly 0.75, both principles would technically be applicable. This would literally be a borderline case for both principles. For a discussion of this type of cases and how to handle them, see Peterson [25].
We here assume that the negative value of misalignment is linear. If this is not the case, the measure could easily be modified to account for this.
References
Aliman, N. M., & Kester, L.: Requisite variety in ethical utility functions for AI value alignment. arXiv preprint arXiv:1907.00430. (2019)
Bostrom, N.: Superintelligence: paths, dangers, strategies. Oxford University Press, Oxford (2014)
Brajer, N., Cozzi, B., Gao, M., Nichols, M., Revoir, M., Balu, S., et al.: Prospective and external evaluation of a machine learning model to predict in-hospital mortality of adults at time of admission. JAMA Netw. Open 3(2), e1920733–e1920733 (2020)
Brown, C.: Consequentialize this. Ethics 121, 749–771 (2011)
Douven, I.: Putting prototypes in place. Cognition 193, 104007 (2019)
Douven, I., Gärdenfors, P.: “What are natural concepts?” A design perspective. Mind Lang. 35, 313–334 (2020)
Dreier, J.: Structures of normative theories. Monist 76, 22–40 (1993)
EU High-Level Expert Group on AI. Ethics guidelines for Trustworthy Ai. Shaping Europe’s digital future (2019). https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai
Feller, A., Pierson, E., Corbett-Davies, S., & Goel, S.: A computer program used for bail and sentencing decisions was labeled biased against blacks. It’s actually not that clear. The Washington Post, October 17 (2016)
Fitelson, B.: A probabilistic theory of coherence. Analysis 63, 194–199 (2003)
Gabriel, I., Ghazavi, V.: The challenge of value alignment: from fairer algorithms to AI safety. In: Veliz (ed.) The Oxford handbook of digital ethics. Oxford University Press, Oxford (2021)
Gärdenfors, P.: Induction, Conceptual Spaces and AI. Philosophy of Science, 57(1), 78–95 (1990)
Gärdenfors P. Conceptual spaces: The geometry of thought. MIT press (2004)
Gärdenfors P. The geometry of meaning: semantics based on conceptual spaces. MIT press (2014)
IBM. ”Value alignment”, https://www.ibm.com/design/ai/ethics/value-alignment/. Accessed 15 Sept 2022
Jäger, G.: Natural color categories are convex sets. In: Aloni, M., Bastiaanse, H., de Jager, T., Schulz, K. (eds.) Logic, language and meaning, pp. 11–20. Springer, Berlin (2010)
Jobin, A., Ienca, M. and Vayena, E.: The global landscape of AI ethics guidelines. Nat Mach Intell, 1, 389–399 (2019)
Kim, T.W., Hooker, J., Donaldson, T.: Taking principles seriously: a hybrid approach to value alignment in artificial intelligence. J. Artif. Intell. Res. 70(2021), 871–890 (2021)
Kruskal, J.B., Wish, M.: Multidimensional Scaling. Sage, Beverly Hills (1978). https://doi.org/10.4135/9781412985130
Maddox, W.T.: Perceptual and decisional separability. Lawrence Erlbaum Associates, Inc (1992)
Mehrabian, A.: Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Curr. Psychol. 14, 261–292 (1996)
Ng, A. Y., & Russell, S.: Algorithms for inverse reinforcement learning. In: Icml (Vol. 1, p. 2) (2000)
Olsson, E.J.: What is the problem of coherence and truth? J. Philos. 99(5), 246–272 (2002)
Peterson, M. The dimensions of consequentialism: ethics, equality and risk. Cambridge University Press (2013)
Peterson M. The ethics of technology: a geometric analysis of five moral principles. Oxford University Press (2017)
Peterson, M. The value alignment problem: a geometric approach. Ethics and Information Technology 21, 19–28 (2019).
Portmore, D.W.: Consequentializing moral theories. Pac. Philos. Q. 88, 39–73 (2007)
Ross, R.T.: A statistic for circular scales. J. Educ. Psychol. 29, 384–389 (1938)
Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161–1178 (1980)
Russell, J.A.: Core affect and the psychological construction of emotion. Psychol. Rev. 110, 145 (2003)
Shah, R., Lewis, M.: Locating the neutral expression in the facial-emotion space. Vis. Cogn. 10, 549–566 (2003)
Shogenji, T.: Is coherence truth conducive? Analysis 59(4), 338–345 (1999)
Schupbach, J.N.: New hope for Shogenji’s coherence measure. Br. J. Philos. Sci. 62(1), 125–142 (2011)
Strößner, C. Criteria for naturalness in conceptual spaces. Synthese 78, 14–36 (2022)
Verheyen, S., Peterson, M.: Can we use conceptual spaces to model moral principles? Rev. Philos. Psychol. 12, 373–395 (2021)
White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People. The White House (2022). https://www.whitehouse.gov/ostp/ai-bill-of-rights/
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Peterson, M., Gärdenfors, P. How to measure value alignment in AI. AI Ethics 4, 1493–1506 (2024). https://doi.org/10.1007/s43681-023-00357-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s43681-023-00357-7