Evaluating Psychological Safety of Large Language Models

Li, Xingxuan; Li, Yutong; Qiu, Lin; Joty, Shafiq; Bing, Lidong

Computer Science > Computation and Language

arXiv:2212.10529 (cs)

[Submitted on 20 Dec 2022 (v1), last revised 29 Feb 2024 (this version, v3)]

Title:Evaluating Psychological Safety of Large Language Models

Authors:Xingxuan Li, Yutong Li, Lin Qiu, Shafiq Joty, Lidong Bing

View PDF HTML (experimental)

Abstract:In this work, we designed unbiased prompts to systematically evaluate the psychological safety of large language models (LLMs). First, we tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI). All models scored higher than the human average on SD-3, suggesting a relatively darker personality pattern. Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns; these models scored higher than self-supervised GPT-3 on the Machiavellianism and narcissism traits on SD-3. Then, we evaluated the LLMs in the GPT series by using well-being tests to study the impact of fine-tuning with more training data. We observed a continuous increase in the well-being scores of GPT models. Following these observations, we showed that fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model. Based on the findings, we recommended the application of systematic and comprehensive psychological metrics to further evaluate and improve the safety of LLMs.

Comments:	Preprint. Under review
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:	arXiv:2212.10529 [cs.CL]
	(or arXiv:2212.10529v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.10529

Submission history

From: Xingxuan Li [view email]
[v1] Tue, 20 Dec 2022 18:45:07 UTC (72 KB)
[v2] Mon, 8 May 2023 16:52:43 UTC (6,582 KB)
[v3] Thu, 29 Feb 2024 13:14:37 UTC (6,922 KB)

Computer Science > Computation and Language

Title:Evaluating Psychological Safety of Large Language Models

Submission history

Access Paper:

References & Citations

1 blog link

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Evaluating Psychological Safety of Large Language Models

Submission history

Access Paper:

References & Citations

1 blog link

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators