Abstract
We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, bpref, that is much more robust to incomplete judgments. Although bpref is highly correlated with average precision when the judgments are effectively complete, the value of bpref deviates from average precision and from its own value as the judgment set degrades, especially at very low levels of assessment. In this work, we propose three new evaluation measures induced AP, subcollection AP, and inferred AP that are equivalent to average precision when the relevance judgments are complete and that are statistical estimates of average precision when relevance judgments are a random subset of complete judgments. We consider natural scenarios which yield highly incomplete judgments such as random judgment sets or very shallow depth pools. We compare and contrast the robustness of the three measures proposed in this work with bpref for both of these scenarios. Through the use of TREC data, we demonstrate that these measures are more robust to incomplete relevance judgments than bpref, both in terms of how well the measures estimate average precision (as measured with complete relevance judgments) and how well they estimate themselves (as measured with complete relevance judgments). Finally, since inferred AP is the most accurate approximation to average precision and the most robust measure in the presence of incomplete judgments, we provide a detailed analysis of this measure, both in terms of its behavior in theory and its implementation in practice.
Similar content being viewed by others
References
Allan J (2004) HARD track overview in TREC 2004: High accuracy retrieval from documents. In: Proceedings of the 13th text retrieval conference (TREC 2004)
Aslam JA, Pavlu V and Savell R (2003). A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm. In: Callan, J, Cormack, G, Clarke, C, Hawking, D, and Smeaton, A (eds) Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 393–394. ACM Press, New york
Aslam JA, Pavlu V, Savell R (2003) A unified model for metasearch, pooling, and system evaluation. In: Frieder O, Hammer J, Quershi S, Seligman L (eds) Proceedings of the 12th international conference on information and knowledge management. ACM Press, pp 484–491
Aslam JA, Pavlu V, Yilmaz E (2006) A statistical method for system evaluation using incomplete judgments. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, pp 541–548
Buckley C (2006) ‘trec_eval’. http://trec.nist.gov/trec_eval/trec_eval.8.1.tar.gz
Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 25–32
Buttcher S, Clarke C, Soboroff I (2006) The TREC 2006 terabyte track. In: Proceedings of the 15th text REtrieval conference (TREC 2006)
Carterette B, Allan J, Sitaraman R (2006) Minimal test collections for retrieval evaluation. In: SIGIR ’06: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 268–275
Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting of the association for computational linguistics. Morgan Kaufmann Publishers, San Francisco, pp 310–318
Clarke CLA, Scholer F, Soboroff I (2005) The TREC 2005 terabyte track. In: Proceedings of the 14th Text REtrieval conference (TREC 2005)
Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Croft, Moffat, van Rijsbergen, wilkinson and Zobel (1998), pp 282–289
Croft WB, Moffat A, van Rijsbergen CJ, Wilkinson R, Zobel J (eds) (1998) In: Proceedings of the 21th annual international ACM SIGIR conference on research and development in information retrieval, ACM Press, New York
Harman D (1995). Overview of the third text REtreival conference (TREC-3). In: Harman, D (eds) Overview of the 3rd text REtrieval conference (TREC-3)’, pp 1–19. US Government Printing Office, Washington D.C., Gaithersburg
Hawking D and Robertson S (2003). On collection size and retrieval effectiveness. Info Retr 6(1): 99–105
Kagolovsky Y and Moehr JR (2003). Current status of the evaluation of information retrieval. J Med Syst 27(5): 409–424
Kraaij W, Over P, Smeaton A (2006) TRECVID 2006—an introduction. In: TREC video retrieval evaluation online proceedings
Kukar M (2006). Quality assessment of individual classifications in machine learning and data mining. Knowl Info Syst 9(3): 364–384
Raghavan V, Bollmann P and Jung GS (1989). A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Info Syst 7(3): 205–229
Tombros A and van Rijsbergen CJ (2004). Query-sensitive similarity measures for information retrieval. Knowl Info Syst 6(5): 617–642
Voorhees EM (2001) Evaluation by highly relevant documents. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 74–82
Voorhees EM (2002) The philosophy of information retrieval evaluation. In: CLEF ’01: revised papers from the 2nd workshop of the cross-language evaluation forum on evaluation of cross-language information retrieval systems. Springer, London, pp 355–370
Voorhees EM, Harman D (1999) Overview of the 7th text retrieval conference (TREC-7). In: Proceedings of the 7th text REtrieval conference (TREC-7)’, pp 1–24
Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM international conference on information and knowledge management, ACM Press, New York
Zobel J (1998) How reliable are the results of large-scale retrieval experiments?, In: Croft et al., pp 307–314
Author information
Authors and Affiliations
Corresponding author
Additional information
We gratefully acknowledge the support provided by NSF grants CCF-0418390 and IIS-0534482.
Rights and permissions
About this article
Cite this article
Yilmaz, E., Aslam, J.A. Estimating average precision when judgments are incomplete. Knowl Inf Syst 16, 173–211 (2008). https://doi.org/10.1007/s10115-007-0101-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0101-7