Abstract
Large language models (LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences. However, this process can inadvertently lead to these models acquiring biases and stereotypes prevalent in society. Prior research has typically tackled the issue of bias through a one-dimensional perspective, concentrating either on locating or mitigating it. This limited perspective has created obstacles in facilitating research on bias to synergistically complement and progressively build upon one another. In this study, we integrate the processes of locating and mitigating bias within a unified framework. Initially, we use causal mediation analysis to trace the causal effects of different components’ activation within a large language model. Building on this, we propose the LSDM (Least Square Debias Method), a knowledge-editing-based method for mitigating gender bias in occupational pronouns, and compare it against two baselines on three gender bias datasets and seven knowledge competency test datasets. The experimental results indicate that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns and the top attention module acting on the final word in the sentence. Furthermore, LSDM mitigates gender bias in the model more effectively than the other methods, reducing gender bias in occupational pronouns by 71.4%, while fully preserving the model’s capabilities in all other aspects.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We set \({\upnu }\) to be three times larger than the empirical standard deviation of embeddings. Refer to Meng et al. [18] for specifics.
References
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: PIQA: Reasoning about physical commonsense in natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 7432–7439 (2020)
Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Adv. Neural Inf. Process. Syst. 29 (2016)
Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from language corpora contain human-like biases. Science 356(6334), 183–186 (2017)
Cheng, S., et al.: Can we edit multimodal large language models? (2023) arXiv:2310.08475
Cheng, S., et al.: Editing language model-based knowledge graph embeddings (2023). arXiv:2301.10405
Choi, J.H., Hickman, K.E., Monahan, A., Schwarcz, D.: ChatGPT goes to law school (2023)
Cohen, D., et al.: Dynamic planning in open-ended dialogue using reinforcement learning (2022). arXiv:2208.02294
Dai, D., Dong, L., Hao, Y., Sui, Z., Wei, F.: Knowledge neurons in pretrained transformers (2021). arXiv abs/2104.08696. https://api.semanticscholar.org/CorpusID:233296761
Ferrara, E.: Should chatGPT be biased? Challenges and risks of bias in large language models (2023). arXiv:2304.03738
Gandikota, R., Materzynska, J., Fiotto-Kaufman, J., Bau, D.: Erasing concepts from diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2426–2436 (2023)
Garg, N., Schiebinger, L., Jurafsky, D., Zou, J.: Word embeddings quantify 100 years of gender and ethnic stereotypes. Proc. Natl. Acad. Sci. U.S.A. 115(16), E3635–E3644 (2018)
Geva, M., Bastings, J., Filippova, K., Globerson, A.: Dissecting recall of factual associations in auto-regressive language models (2023). arXiv abs/2304.14767. https://api.semanticscholar.org/CorpusID:258417932
Geva, M., Caciularu, A., Wang, K., Goldberg, Y.: Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space (2022). arXiv abs/2203.14680. https://api.semanticscholar.org/CorpusID:247762385
Gilson, A., et al.: How does chatGPT perform on the united states medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9(1), e45312 (2023)
Guo, Y., Yang, Y., Abbasi, A.: Auto-debias: debiasing masked language models with automated biased prompts. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1012–1023 (2022)
Kohonen, T.: Correlation matrix memories. IEEE Trans. Comput. C-21, 353–359 (1972). https://api.semanticscholar.org/CorpusID:21483100
May, C., Wang, A., Bordia, S., Bowman, S.R., Rudinger, R.: On measuring social biases in sentence encoders (2019). arXiv:1903.10561
Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual associations in GPT. Adv. Neural. Inf. Process. Syst. 35, 17359–17372 (2022)
Meng, K., Sharma, A.S., Andonian, A., Belinkov, Y., Bau, D.: Mass-editing memory in a transformer (2022). arXiv:2210.07229
Paperno, D., et al.: The LAMBADA dataset: word prediction requiring a broad discourse context (2016). arXiv:1606.06031
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://api.semanticscholar.org/CorpusID:160025533
Ramamurthy, R., et al.: Is reinforcement learning (not) for natural language processing? Benchmarks, baselines, and building blocks for natural language policy optimization (2022). arXiv:2210.01241
Roemmele, M., Bejan, C.A., Gordon, A.S.: Choice of plausible alternatives: an evaluation of commonsense causal reasoning. In: 2011 AAAI Spring Symposium Series (2011)
Rudinger, R., Naradowsky, J., Leonard, B., Van Durme, B.: Gender bias in coreference resolution (2018). arXiv:1804.09301
Sap, M., Rashkin, H., Chen, D., LeBras, R., Choi, Y.: Socialiqa: commonsense reasoning about social interactions (2019). arXiv:1904.09728
Sun, T., et al.: Mitigating gender bias in natural language processing: literature review (2019). arXiv:1906.08976
Talmor, A., Herzig, J., Lourie, N., Berant, J.: Commonsenseqa: a question answering challenge targeting commonsense knowledge (2018). arXiv:1811.00937
Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023). arXiv:2302.13971
Vig, J., et al.: Causal mediation analysis for interpreting neural NLP: the case of gender bias (2020). arXiv:2004.12265
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: a multi-task benchmark and analysis platform for natural language understanding (2018). arXiv:1804.07461
Wang, B., Komatsuzaki, A.: GPT-J-6B: a 6 billion parameter autoregressive language model (2021)
Webster, K., et al.: Measuring and reducing gendered correlations in pre-trained models (2020). arXiv:2010.06032
Yang, A., et al.: Baichuan 2: open large-scale language models (2023). arXiv:2309.10305
Yao, Y., et al.: Editing large language models: problems, methods, and opportunities (2023). arXiv:2305.13172
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: Hellaswag: can a machine really finish your sentence? (2019). arXiv:1905.07830
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Gender bias in coreference resolution: evaluation and debiasing methods (2018). arXiv:1804.06876
Ziegler, D.M., et al.: Fine-tuning language models from human preferences (2019). arXiv:1909.08593
Zmigrod, R., Mielke, S.J., Wallach, H., Cotterell, R.: Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1651–1661 (2019)
Cai, Y., Cao, D., Guo, R., Wen, Y., Liu, G., Chen, E.: Locating and mitigating gender bias in large language models (2024). arXiv:2403.14409
Cai, Y., Cao, D., Guo, R., Wen, Y., Liu, G., Chen, E.: Editing knowledge representation of language model via rephrased prefix prompts (2024). arXiv:2403.14381
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Cai, Y., Cao, D., Guo, R., Wen, Y., Liu, G., Chen, E. (2024). Locating and Mitigating Gender Bias in Large Language Models. In: Huang, DS., Si, Z., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14878. Springer, Singapore. https://doi.org/10.1007/978-981-97-5672-8_40
Download citation
DOI: https://doi.org/10.1007/978-981-97-5672-8_40
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5671-1
Online ISBN: 978-981-97-5672-8
eBook Packages: Computer ScienceComputer Science (R0)