Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Kumar, Shachi H; Sahay, Saurav; Mazumder, Sahisnu; Okur, Eda; Manuvinakurike, Ramesh; Beckage, Nicole; Su, Hsuan; Lee, Hung-yi; Nachman, Lama

Computer Science > Computation and Language

arXiv:2408.03907 (cs)

[Submitted on 7 Aug 2024]

Title:Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Authors:Shachi H Kumar, Saurav Sahay, Sahisnu Mazumder, Eda Okur, Ramesh Manuvinakurike, Nicole Beckage, Hsuan Su, Hung-yi Lee, Lama Nachman

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.

Comments:	6 pages paper content, 17 pages of appendix
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.03907 [cs.CL]
	(or arXiv:2408.03907v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2408.03907

Submission history

From: Shachi Hullumane Kumar [view email]
[v1] Wed, 7 Aug 2024 17:11:34 UTC (11,607 KB)

Computer Science > Computation and Language

Title:Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators