Weakly Supervised Detection of Hallucinations in LLM Activations

Rateike, Miriam; Cintas, Celia; Wamburu, John; Akumu, Tanya; Speakman, Skyler

Computer Science > Machine Learning

arXiv:2312.02798 (cs)

[Submitted on 5 Dec 2023]

Title:Weakly Supervised Detection of Hallucinations in LLM Activations

Authors:Miriam Rateike, Celia Cintas, John Wamburu, Tanya Akumu, Skyler Speakman

View PDF HTML (experimental)

Abstract:We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2312.02798 [cs.LG]
	(or arXiv:2312.02798v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2312.02798

Submission history

From: Miriam Rateike [view email]
[v1] Tue, 5 Dec 2023 14:35:11 UTC (2,256 KB)

Computer Science > Machine Learning

Title:Weakly Supervised Detection of Hallucinations in LLM Activations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Weakly Supervised Detection of Hallucinations in LLM Activations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators