TY - JOUR
AU - Bencke, Luciana
AU - Paula, Felipe S. F.
AU - dos Santos, Bruno G. T.
AU - P. Moreira, Viviane
PY - 2024/10/14
TI - Can we trust LLMs as relevance judges?
JF - Anais do Simpósio Brasileiro de Banco de Dados (SBBD); 2024: Anais do XXXIX Simpósio Brasileiro de Bancos de DadosDO - 10.5753/sbbd.2024.243130
KW -
N2 - Evaluation is key for Information Retrieval systems and requires test collections consisting of documents, queries, and relevance judgments. Obtaining relevance judgments is the most costly step in creating test collections because they demand human intervention. A recent tendency in the area is to replace humans with Large Language Models (LLMs) as the source of relevance judgments. In this paper, we investigate the use of LLMs as a source of relevance judgments. Our goal is to find out how reliable LLMs are in this task. We experimented with different LLMs and test collections in Portuguese. Our results show that LLMs can yield promising performance that is competitive with human annotations.
UR - https://sol.sbc.org.br/index.php/sbbd/article/view/30724