How to measure value alignment in AI | AI and Ethics
Skip to main content

Advertisement

How to measure value alignment in AI

  • Original Research
  • Published:
AI and Ethics Aims and scope Submit manuscript

Abstract

How can we make sure that AI systems align with human values and norms? An important step towards reaching this goal is to develop a method for measuring value alignment in AI. Unless we can measure value alignment, we cannot adjudicate whether one AI is better aligned with human morality than another. The aim of this paper is to develop two quantitative measures of value alignment that estimate how well an AI system aligns with human values or norms. The theoretical basis of the measures we propose is the theory of conceptual spaces (Gärdenfors 1990, 2000, 2014, Douven and Gärdenfors 2020, and Strössner 2022). The key idea is to represent values and norms as geometric regions in multidimensional similarity spaces (Peterson 2017 and Verheyen & Peterson 2021). Using conceptual spaces for measuring value alignment has several advantages over alternative measures based on expected utility losses, because this does not require researchers to explicitly assign utilities to moral “losses” ex ante. As proof of concept, we apply our measures to three examples: ChatGPT-3, a medical AI classifier developed by Bajer et al., and finally to COMPAS, a controversial AI tool assisting judges in making bail and sentencing decisions. One of our findings is that ChatGPT-3 is so poorly aligned with human morality that it is pointless to apply our measures to it.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. EU High-Level Expert Group on AI.

  2. See [36].

  3. Jobin et al. [17] (2019: 389).

  4. The problem of value alignment has of course been extensively discussed in recent years, but the quantitative measures we propose are novel. Cf. Kim et al. [18], Montes et al. [25], and Peterson [26].

  5. For useful discussions, see, e.g., Ng and Russell [22], Gabriel and Ghazavi [11].

  6. Gabriel and Ghazavi (2021: 9).

  7. Kim et al. [18].

  8. See, for instance, Aliman and Kester [1], but also Bostrom [2] and Russel et al. [35].

  9. For discussions of how to “consequentialize” nonconsequentialist moral theories, see, Dreier [7], Portmore [27], Peterson [24], and Brown [4].

  10. Dreier [7], Portmore [27], Peterson [24], and Brown [4].

  11. See e.g., Schogenji [32], Olsson [23], Fitelson [10], and Schupbach [33].

  12. There exist several psychological tests for determining whether dimensions are integral or separable, see Maddox [20].

  13. For the case of color properties, this hypothesis has been given strong support by Jäger [16].

  14. Peterson [26] discusses the idea that conceptual spaces are useful for aligning AI with human values but offers no measure assessing the degree to which the conceptual space of an AI aligns with that of a human.

  15. In addition, note that the alignment measure does not require any form of probability judgments.

  16. The data reported here is for the March 2023 version of ChatGPT-3. We also tried the May 2023, version but observed the same instability and misalignment as reported here.

  17. An asterisk means that no human similarity data are available.

  18. In if the AUC value is exactly 0.75, both principles would technically be applicable. This would literally be a borderline case for both principles. For a discussion of this type of cases and how to handle them, see Peterson [25].

  19. We here assume that the negative value of misalignment is linear. If this is not the case, the measure could easily be modified to account for this.

References

  1. Aliman, N. M., & Kester, L.: Requisite variety in ethical utility functions for AI value alignment. arXiv preprint arXiv:1907.00430. (2019)

  2. Bostrom, N.: Superintelligence: paths, dangers, strategies. Oxford University Press, Oxford (2014)

    Google Scholar 

  3. Brajer, N., Cozzi, B., Gao, M., Nichols, M., Revoir, M., Balu, S., et al.: Prospective and external evaluation of a machine learning model to predict in-hospital mortality of adults at time of admission. JAMA Netw. Open 3(2), e1920733–e1920733 (2020)

    Article  Google Scholar 

  4. Brown, C.: Consequentialize this. Ethics 121, 749–771 (2011)

    Article  Google Scholar 

  5. Douven, I.: Putting prototypes in place. Cognition 193, 104007 (2019)

    Article  MATH  Google Scholar 

  6. Douven, I., Gärdenfors, P.: “What are natural concepts?” A design perspective. Mind Lang. 35, 313–334 (2020)

    Article  MATH  Google Scholar 

  7. Dreier, J.: Structures of normative theories. Monist 76, 22–40 (1993)

    Article  MATH  Google Scholar 

  8. EU High-Level Expert Group on AI. Ethics guidelines for Trustworthy Ai. Shaping Europe’s digital future (2019). https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai

  9. Feller, A., Pierson, E., Corbett-Davies, S., & Goel, S.: A computer program used for bail and sentencing decisions was labeled biased against blacks. It’s actually not that clear. The Washington Post, October 17 (2016)

  10. Fitelson, B.: A probabilistic theory of coherence. Analysis 63, 194–199 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gabriel, I., Ghazavi, V.: The challenge of value alignment: from fairer algorithms to AI safety. In: Veliz (ed.) The Oxford handbook of digital ethics. Oxford University Press, Oxford (2021)

    MATH  Google Scholar 

  12. Gärdenfors, P.: Induction, Conceptual Spaces and AI. Philosophy of Science, 57(1), 78–95 (1990)

  13. Gärdenfors P. Conceptual spaces: The geometry of thought. MIT press (2004)

  14. Gärdenfors P. The geometry of meaning: semantics based on conceptual spaces. MIT press (2014)

  15. IBM. ”Value alignment”, https://www.ibm.com/design/ai/ethics/value-alignment/. Accessed 15 Sept 2022

  16. Jäger, G.: Natural color categories are convex sets. In: Aloni, M., Bastiaanse, H., de Jager, T., Schulz, K. (eds.) Logic, language and meaning, pp. 11–20. Springer, Berlin (2010)

    Chapter  MATH  Google Scholar 

  17. Jobin, A., Ienca, M. and Vayena, E.: The global landscape of AI ethics guidelines. Nat Mach Intell, 1, 389–399 (2019)

  18. Kim, T.W., Hooker, J., Donaldson, T.: Taking principles seriously: a hybrid approach to value alignment in artificial intelligence. J. Artif. Intell. Res. 70(2021), 871–890 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  19. Kruskal, J.B., Wish, M.: Multidimensional Scaling. Sage, Beverly Hills (1978). https://doi.org/10.4135/9781412985130

    Book  MATH  Google Scholar 

  20. Maddox, W.T.: Perceptual and decisional separability. Lawrence Erlbaum Associates, Inc (1992)

  21. Mehrabian, A.: Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Curr. Psychol. 14, 261–292 (1996)

    Article  MATH  Google Scholar 

  22. Ng, A. Y., & Russell, S.: Algorithms for inverse reinforcement learning. In: Icml (Vol. 1, p. 2) (2000)

  23. Olsson, E.J.: What is the problem of coherence and truth? J. Philos. 99(5), 246–272 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  24. Peterson, M. The dimensions of consequentialism: ethics, equality and risk. Cambridge University Press (2013)

  25. Peterson M. The ethics of technology: a geometric analysis of five moral principles. Oxford University Press (2017)

  26. Peterson, M. The value alignment problem: a geometric approach. Ethics and Information Technology 21, 19–28 (2019).

  27. Portmore, D.W.: Consequentializing moral theories. Pac. Philos. Q. 88, 39–73 (2007)

    Article  Google Scholar 

  28. Ross, R.T.: A statistic for circular scales. J. Educ. Psychol. 29, 384–389 (1938)

    Article  MATH  Google Scholar 

  29. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161–1178 (1980)

    Article  MATH  Google Scholar 

  30. Russell, J.A.: Core affect and the psychological construction of emotion. Psychol. Rev. 110, 145 (2003)

    Article  MATH  Google Scholar 

  31. Shah, R., Lewis, M.: Locating the neutral expression in the facial-emotion space. Vis. Cogn. 10, 549–566 (2003)

    Article  MATH  Google Scholar 

  32. Shogenji, T.: Is coherence truth conducive? Analysis 59(4), 338–345 (1999)

    Article  MATH  Google Scholar 

  33. Schupbach, J.N.: New hope for Shogenji’s coherence measure. Br. J. Philos. Sci. 62(1), 125–142 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  34. Strößner, C. Criteria for naturalness in conceptual spaces. Synthese 78, 14–36 (2022)

  35. Verheyen, S., Peterson, M.: Can we use conceptual spaces to model moral principles? Rev. Philos. Psychol. 12, 373–395 (2021)

    Article  MATH  Google Scholar 

  36. White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People. The White House (2022). https://www.whitehouse.gov/ostp/ai-bill-of-rights/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Peterson.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Peterson, M., Gärdenfors, P. How to measure value alignment in AI. AI Ethics 4, 1493–1506 (2024). https://doi.org/10.1007/s43681-023-00357-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s43681-023-00357-7

Keywords