Estimating average precision when judgments are incomplete

Yilmaz, Emine; Aslam, Javed A.

doi:10.1007/s10115-007-0101-7

Estimating average precision when judgments are incomplete

Regular Paper
Published: 14 August 2007

Volume 16, pages 173–211, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Emine Yilmaz¹ &
Javed A. Aslam¹

292 Accesses
31 Citations
Explore all metrics

Abstract

We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, bpref, that is much more robust to incomplete judgments. Although bpref is highly correlated with average precision when the judgments are effectively complete, the value of bpref deviates from average precision and from its own value as the judgment set degrades, especially at very low levels of assessment. In this work, we propose three new evaluation measures induced AP, subcollection AP, and inferred AP that are equivalent to average precision when the relevance judgments are complete and that are statistical estimates of average precision when relevance judgments are a random subset of complete judgments. We consider natural scenarios which yield highly incomplete judgments such as random judgment sets or very shallow depth pools. We compare and contrast the robustness of the three measures proposed in this work with bpref for both of these scenarios. Through the use of TREC data, we demonstrate that these measures are more robust to incomplete relevance judgments than bpref, both in terms of how well the measures estimate average precision (as measured with complete relevance judgments) and how well they estimate themselves (as measured with complete relevance judgments). Finally, since inferred AP is the most accurate approximation to average precision and the most robust measure in the presence of incomplete judgments, we provide a detailed analysis of this measure, both in terms of its behavior in theory and its implementation in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

References

Allan J (2004) HARD track overview in TREC 2004: High accuracy retrieval from documents. In: Proceedings of the 13th text retrieval conference (TREC 2004)
Aslam JA, Pavlu V and Savell R (2003). A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm. In: Callan, J, Cormack, G, Clarke, C, Hawking, D, and Smeaton, A (eds) Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 393–394. ACM Press, New york
Google Scholar
Aslam JA, Pavlu V, Savell R (2003) A unified model for metasearch, pooling, and system evaluation. In: Frieder O, Hammer J, Quershi S, Seligman L (eds) Proceedings of the 12th international conference on information and knowledge management. ACM Press, pp 484–491
Aslam JA, Pavlu V, Yilmaz E (2006) A statistical method for system evaluation using incomplete judgments. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, pp 541–548
Buckley C (2006) ‘trec_eval’. http://trec.nist.gov/trec_eval/trec_eval.8.1.tar.gz
Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 25–32
Buttcher S, Clarke C, Soboroff I (2006) The TREC 2006 terabyte track. In: Proceedings of the 15th text REtrieval conference (TREC 2006)
Carterette B, Allan J, Sitaraman R (2006) Minimal test collections for retrieval evaluation. In: SIGIR ’06: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 268–275
Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting of the association for computational linguistics. Morgan Kaufmann Publishers, San Francisco, pp 310–318
Clarke CLA, Scholer F, Soboroff I (2005) The TREC 2005 terabyte track. In: Proceedings of the 14th Text REtrieval conference (TREC 2005)
Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Croft, Moffat, van Rijsbergen, wilkinson and Zobel (1998), pp 282–289
Croft WB, Moffat A, van Rijsbergen CJ, Wilkinson R, Zobel J (eds) (1998) In: Proceedings of the 21th annual international ACM SIGIR conference on research and development in information retrieval, ACM Press, New York
Harman D (1995). Overview of the third text REtreival conference (TREC-3). In: Harman, D (eds) Overview of the 3rd text REtrieval conference (TREC-3)’, pp 1–19. US Government Printing Office, Washington D.C., Gaithersburg
Google Scholar
Hawking D and Robertson S (2003). On collection size and retrieval effectiveness. Info Retr 6(1): 99–105
Article Google Scholar
Kagolovsky Y and Moehr JR (2003). Current status of the evaluation of information retrieval. J Med Syst 27(5): 409–424
Article Google Scholar
Kraaij W, Over P, Smeaton A (2006) TRECVID 2006—an introduction. In: TREC video retrieval evaluation online proceedings
Kukar M (2006). Quality assessment of individual classifications in machine learning and data mining. Knowl Info Syst 9(3): 364–384
Article Google Scholar
Raghavan V, Bollmann P and Jung GS (1989). A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Info Syst 7(3): 205–229
Article Google Scholar
Tombros A and van Rijsbergen CJ (2004). Query-sensitive similarity measures for information retrieval. Knowl Info Syst 6(5): 617–642
Google Scholar
Voorhees EM (2001) Evaluation by highly relevant documents. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 74–82
Voorhees EM (2002) The philosophy of information retrieval evaluation. In: CLEF ’01: revised papers from the 2nd workshop of the cross-language evaluation forum on evaluation of cross-language information retrieval systems. Springer, London, pp 355–370
Voorhees EM, Harman D (1999) Overview of the 7th text retrieval conference (TREC-7). In: Proceedings of the 7th text REtrieval conference (TREC-7)’, pp 1–24
Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM international conference on information and knowledge management, ACM Press, New York
Zobel J (1998) How reliable are the results of large-scale retrieval experiments?, In: Croft et al., pp 307–314

Download references

Author information

Authors and Affiliations

College of Computer and Information Science, Northeastern University, Boston, MA, 02115, USA
Emine Yilmaz & Javed A. Aslam

Authors

Emine Yilmaz
View author publications
You can also search for this author in PubMed Google Scholar
Javed A. Aslam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emine Yilmaz.

Additional information

We gratefully acknowledge the support provided by NSF grants CCF-0418390 and IIS-0534482.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yilmaz, E., Aslam, J.A. Estimating average precision when judgments are incomplete. Knowl Inf Syst 16, 173–211 (2008). https://doi.org/10.1007/s10115-007-0101-7

Download citation

Received: 11 November 2006
Revised: 27 March 2007
Accepted: 10 June 2007
Published: 14 August 2007
Issue Date: August 2008
DOI: https://doi.org/10.1007/s10115-007-0101-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Estimating average precision when judgments are incomplete

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Streamlining Evaluation with ir-measures

Advances in Bias and Fairness in Information Retrieval

An Intrinsic Framework of Information Retrieval Evaluation Measures

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Estimating average precision when judgments are incomplete

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Streamlining Evaluation with ir-measures

Advances in Bias and Fairness in Information Retrieval

An Intrinsic Framework of Information Retrieval Evaluation Measures

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation