How to measure value alignment in AI

Peterson, Martin; Gärdenfors, Peter

doi:10.1007/s43681-023-00357-7

How to measure value alignment in AI

Original Research
Published: 30 October 2023

Volume 4, pages 1493–1506, (2024)
Cite this article

AI and Ethics Aims and scope Submit manuscript

1035 Accesses
Explore all metrics

Abstract

How can we make sure that AI systems align with human values and norms? An important step towards reaching this goal is to develop a method for measuring value alignment in AI. Unless we can measure value alignment, we cannot adjudicate whether one AI is better aligned with human morality than another. The aim of this paper is to develop two quantitative measures of value alignment that estimate how well an AI system aligns with human values or norms. The theoretical basis of the measures we propose is the theory of conceptual spaces (Gärdenfors 1990, 2000, 2014, Douven and Gärdenfors 2020, and Strössner 2022). The key idea is to represent values and norms as geometric regions in multidimensional similarity spaces (Peterson 2017 and Verheyen & Peterson 2021). Using conceptual spaces for measuring value alignment has several advantages over alternative measures based on expected utility losses, because this does not require researchers to explicitly assign utilities to moral “losses” ex ante. As proof of concept, we apply our measures to three examples: ChatGPT-3, a medical AI classifier developed by Bajer et al., and finally to COMPAS, a controversial AI tool assisting judges in making bail and sentencing decisions. One of our findings is that ChatGPT-3 is so poorly aligned with human morality that it is pointless to apply our measures to it.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence, Values, and Alignment

Article Open access 01 September 2020

STELA: a community-centred approach to norm elicitation for AI alignment

Article Open access 19 March 2024

When is it acceptable to break the rules? Knowledge representation of moral judgements based on empirical data

Article Open access 13 July 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

EU High-Level Expert Group on AI.
See [36].
Jobin et al. [17] (2019: 389).
The problem of value alignment has of course been extensively discussed in recent years, but the quantitative measures we propose are novel. Cf. Kim et al. [18], Montes et al. [25], and Peterson [26].
For useful discussions, see, e.g., Ng and Russell [22], Gabriel and Ghazavi [11].
Gabriel and Ghazavi (2021: 9).
Kim et al. [18].
See, for instance, Aliman and Kester [1], but also Bostrom [2] and Russel et al. [35].
For discussions of how to “consequentialize” nonconsequentialist moral theories, see, Dreier [7], Portmore [27], Peterson [24], and Brown [4].
Dreier [7], Portmore [27], Peterson [24], and Brown [4].
See e.g., Schogenji [32], Olsson [23], Fitelson [10], and Schupbach [33].
There exist several psychological tests for determining whether dimensions are integral or separable, see Maddox [20].
For the case of color properties, this hypothesis has been given strong support by Jäger [16].
Peterson [26] discusses the idea that conceptual spaces are useful for aligning AI with human values but offers no measure assessing the degree to which the conceptual space of an AI aligns with that of a human.
In addition, note that the alignment measure does not require any form of probability judgments.
The data reported here is for the March 2023 version of ChatGPT-3. We also tried the May 2023, version but observed the same instability and misalignment as reported here.
An asterisk means that no human similarity data are available.
In if the AUC value is exactly 0.75, both principles would technically be applicable. This would literally be a borderline case for both principles. For a discussion of this type of cases and how to handle them, see Peterson [25].
We here assume that the negative value of misalignment is linear. If this is not the case, the measure could easily be modified to account for this.

References

Aliman, N. M., & Kester, L.: Requisite variety in ethical utility functions for AI value alignment. arXiv preprint arXiv:1907.00430. (2019)
Bostrom, N.: Superintelligence: paths, dangers, strategies. Oxford University Press, Oxford (2014)
Google Scholar
Brajer, N., Cozzi, B., Gao, M., Nichols, M., Revoir, M., Balu, S., et al.: Prospective and external evaluation of a machine learning model to predict in-hospital mortality of adults at time of admission. JAMA Netw. Open 3(2), e1920733–e1920733 (2020)
Article Google Scholar
Brown, C.: Consequentialize this. Ethics 121, 749–771 (2011)
Article Google Scholar
Douven, I.: Putting prototypes in place. Cognition 193, 104007 (2019)
Article MATH Google Scholar
Douven, I., Gärdenfors, P.: “What are natural concepts?” A design perspective. Mind Lang. 35, 313–334 (2020)
Article MATH Google Scholar
Dreier, J.: Structures of normative theories. Monist 76, 22–40 (1993)
Article MATH Google Scholar
EU High-Level Expert Group on AI. Ethics guidelines for Trustworthy Ai. Shaping Europe’s digital future (2019). https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai
Feller, A., Pierson, E., Corbett-Davies, S., & Goel, S.: A computer program used for bail and sentencing decisions was labeled biased against blacks. It’s actually not that clear. The Washington Post, October 17 (2016)
Fitelson, B.: A probabilistic theory of coherence. Analysis 63, 194–199 (2003)
Article MathSciNet MATH Google Scholar
Gabriel, I., Ghazavi, V.: The challenge of value alignment: from fairer algorithms to AI safety. In: Veliz (ed.) The Oxford handbook of digital ethics. Oxford University Press, Oxford (2021)
MATH Google Scholar
Gärdenfors, P.: Induction, Conceptual Spaces and AI. Philosophy of Science, 57(1), 78–95 (1990)
Gärdenfors P. Conceptual spaces: The geometry of thought. MIT press (2004)
Gärdenfors P. The geometry of meaning: semantics based on conceptual spaces. MIT press (2014)
IBM. ”Value alignment”, https://www.ibm.com/design/ai/ethics/value-alignment/. Accessed 15 Sept 2022
Jäger, G.: Natural color categories are convex sets. In: Aloni, M., Bastiaanse, H., de Jager, T., Schulz, K. (eds.) Logic, language and meaning, pp. 11–20. Springer, Berlin (2010)
Chapter MATH Google Scholar
Jobin, A., Ienca, M. and Vayena, E.: The global landscape of AI ethics guidelines. Nat Mach Intell, 1, 389–399 (2019)
Kim, T.W., Hooker, J., Donaldson, T.: Taking principles seriously: a hybrid approach to value alignment in artificial intelligence. J. Artif. Intell. Res. 70(2021), 871–890 (2021)
Article MathSciNet MATH Google Scholar
Kruskal, J.B., Wish, M.: Multidimensional Scaling. Sage, Beverly Hills (1978). https://doi.org/10.4135/9781412985130
Book MATH Google Scholar
Maddox, W.T.: Perceptual and decisional separability. Lawrence Erlbaum Associates, Inc (1992)
Mehrabian, A.: Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Curr. Psychol. 14, 261–292 (1996)
Article MATH Google Scholar
Ng, A. Y., & Russell, S.: Algorithms for inverse reinforcement learning. In: Icml (Vol. 1, p. 2) (2000)
Olsson, E.J.: What is the problem of coherence and truth? J. Philos. 99(5), 246–272 (2002)
Article MathSciNet MATH Google Scholar
Peterson, M. The dimensions of consequentialism: ethics, equality and risk. Cambridge University Press (2013)
Peterson M. The ethics of technology: a geometric analysis of five moral principles. Oxford University Press (2017)
Peterson, M. The value alignment problem: a geometric approach. Ethics and Information Technology 21, 19–28 (2019).
Portmore, D.W.: Consequentializing moral theories. Pac. Philos. Q. 88, 39–73 (2007)
Article Google Scholar
Ross, R.T.: A statistic for circular scales. J. Educ. Psychol. 29, 384–389 (1938)
Article MATH Google Scholar
Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161–1178 (1980)
Article MATH Google Scholar
Russell, J.A.: Core affect and the psychological construction of emotion. Psychol. Rev. 110, 145 (2003)
Article MATH Google Scholar
Shah, R., Lewis, M.: Locating the neutral expression in the facial-emotion space. Vis. Cogn. 10, 549–566 (2003)
Article MATH Google Scholar
Shogenji, T.: Is coherence truth conducive? Analysis 59(4), 338–345 (1999)
Article MATH Google Scholar
Schupbach, J.N.: New hope for Shogenji’s coherence measure. Br. J. Philos. Sci. 62(1), 125–142 (2011)
Article MathSciNet MATH Google Scholar
Strößner, C. Criteria for naturalness in conceptual spaces. Synthese 78, 14–36 (2022)
Verheyen, S., Peterson, M.: Can we use conceptual spaces to model moral principles? Rev. Philos. Psychol. 12, 373–395 (2021)
Article MATH Google Scholar
White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People. The White House (2022). https://www.whitehouse.gov/ostp/ai-bill-of-rights/

Download references

Author information

Authors and Affiliations

Department of Philosophy, Texas A&M University, College Station, USA
Martin Peterson
LUCS Department of Philosophy LUX, Lund University, Box 192, S-221 00, Lund, Sweden
Peter Gärdenfors

Authors

Martin Peterson
View author publications
You can also search for this author in PubMed Google Scholar
Peter Gärdenfors
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Peterson.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Peterson, M., Gärdenfors, P. How to measure value alignment in AI. AI Ethics 4, 1493–1506 (2024). https://doi.org/10.1007/s43681-023-00357-7

Download citation

Received: 10 May 2023
Accepted: 03 October 2023
Published: 30 October 2023
Issue Date: November 2024
DOI: https://doi.org/10.1007/s43681-023-00357-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

How to measure value alignment in AI

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Artificial Intelligence, Values, and Alignment

STELA: a community-centred approach to norm elicitation for AI alignment

When is it acceptable to break the rules? Knowledge representation of moral judgements based on empirical data

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now