Abstract
Despite the availability of benchmark machine learning (ML) repositories (e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of pointing out which is the best set of datasets to serve as gold standard to test different ML algorithms. In recent studies, Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good ML benchmark. This work applied IRT to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers. Several classifiers ranging from classical to ensembles ones were evaluated using IRT models, which could simultaneously estimate dataset difficulty and classifiers’ ability. The Glicko-2 rating system was applied on the top of IRT to summarize the innate ability and aptitude of classifiers. It was observed that not all datasets from OpenML-CC18 are really useful to evaluate classifiers. Most datasets evaluated in this work (84%) contain easy instances in general (e.g., around 10% of difficult instances only). Also, 80% of the instances in half of this benchmark are very discriminating ones, which can be of great use for pairwise algorithm comparison, but not useful to push classifiers abilities. This paper presents this new evaluation methodology based on IRT as well as the tool decodIRT, developed to guide IRT estimation over ML benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Link to access OpenML-CC18: https://www.openml.org/s/99.
- 2.
Link to the source code: https://github.com/LucasFerraroCardoso/IRT_OpenML.
- 3.
All classification results can be obtained at https://github.com/LucasFerraroCardoso/IRT_OpenML/tree/master/benchmarking.
- 4.
All data generated can be accessed at https://github.com/LucasFerraroCardoso/IRT_OpenML/tree/master/BRACIS.
References
Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)
Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newslet. 15(2), 49–60 (2014)
Monard, M.C., Baranauskas, J.A.: Conceitos sobre aprendizado de máquina. Sistemas inteligentes-Fundamentos e aplicações 1(1), 32 (2003)
Martínez-Plumed, F., Prudêncio, R. B., Martínez-Usó, A., Hernández-Orallo, J.: Making sense of item response theory in machine learning. In: Proceedings of the Twenty-second European Conference on Artificial Intelligence, pp. 1140–1148. IOS Press (2016)
Prudêncio, R.B., Hernández-Orallo, J., Martınez-Usó, A.: Analysis of instance hardness in machine learning using item response theory. In: Second International Workshop on Learning over Multiple Contexts in ECML 2015, Porto, Portugal (2015)
Martínez-Plumed, F., Prudêncio, R.B., Martínez-Usó, A., Hernández-Orallo, J.: Item response theory in AI: analysing machine learning classifiers at the instance level. Artif. Intell. 271, 18–42 (2019)
Bischl, B., et al.: OpenML benchmarking suites and the OpenML100. arXiv preprint arXiv:1708.03731 (2017)
Samothrakis, S., Perez, D., Lucas, S.M., Rohlfshagen, P.: Predicting dominance rankings for score-based games. IEEE Trans. Comput. Intell. AI Games 8(1), 1–12 (2014)
Glickman, M.E.: Example of the Glicko-2 system, pp. 1–6. Boston University (2012)
de Andrade, D.F., Tavares, H.R., da Cunha Valle, R.: Teoria da Resposta ao Item: conceitos e aplicações. ABE, Sao Paulo (2000)
Veček, N., Mernik, M., Črepinšek, M.: A chess rating system for evolutionary algorithms: a new method for the comparison and ranking of evolutionary algorithms. Inf. Sci. 277, 656–679 (2014)
Birnbaum, A.L.: Some latent trait models and their use in inferring an examinee’s ability. Statistical theories of mental test scores (1968)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Adedoyin, O.O., Mokobi, T.: Using IRT psychometric analysis in examining the quality of junior certificate mathematics multiple choice examination test items. Int. J. Asian Soc. Sci. 3(4), 992–1011 (2013)
Lord, F.M., Wingersky, M.S.: Comparison of IRT true-score and equipercentile observed-score “equatings”. Appl. Psychol. Meas. 8(4), 453–461 (1984)
Pereira, D.G., Afonso, A., Medeiros, F.M.: Overview of Friedman’s test and post-hoc analysis. Commun. Stat.-Simul. Comput. 44(10), 2636–2653 (2015)
Nemenyi, P.: Distribution-free multiple comparisons. In: Biometrics, vol. 18, no. 2, p. 263 (1962). 1441 I ST, NW, SUITE 700, WASHINGTON, DC 20005–2210: International Biometric Soc
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Cardoso, L.F.F., Santos, V.C.A., Francês, R.S.K., Prudêncio, R.B.C., Alves, R.C.O. (2020). Decoding Machine Learning Benchmarks. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12320. Springer, Cham. https://doi.org/10.1007/978-3-030-61380-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-61380-8_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61379-2
Online ISBN: 978-3-030-61380-8
eBook Packages: Computer ScienceComputer Science (R0)