Abstract
The task of learning from imbalanced datasets has been widely investigated in the binary, multi-class and multi-label classification scenarios. Although this problem also affects hierarchical datasets, there are few work in the literature dealing with it. Meanwhile, the local classifier approaches are the most used techniques in the literature to deal with Hierarchical Classification problems. In this paper, we present new ways to handle data imbalance in hierarchical classification problems when using local classifiers approaches. We propose three different resampling schemas, according to the local classification approach: (1) Local Classifiers per Node; (2) Local Classifiers per Parent Node; and (3) Local Classifiers per Level. In order to define how imbalanced a certain hierarchical dataset is, we also propose three novel metrics to measure the imbalance in hierarchical datasets considering the different local classification approaches. The experimental evaluation in eight well-known datasets showed that the imbalance metrics can indeed measure the datasets imbalance and the proposed resampling schemas are able to improve the classification results when compared to baselines, state-of-the-art and related work approaches.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Available at http://sites.labic.icmc.usp.br/jeanmetz/datasets.html.
Available at https://github.com/mdeff/fma.
Available at https://www.imageclef.org/2009/medanno.
Available at http://lshtc.iit.demokritos.gr/.
Available at https://dtai.cs.kuleuven.be/clus/.
Available at https://cs.gmu.edu/~mlbio/HierCost/.
Available at http://scikit-learn.org/.
Available at https://github.com/tsoumakas/mulan/.
Available at https://github.com/scikit-learn-contrib/imbalanced-learn.
Available at https://github.com/rodolfomp123/imb-mulan.
References
Ariyaratne HB, Zhang D (2012) A novel automatic hierachical approach to music genre classification. In: Proceedings of the IEEE international conference on multimedia and expo workshops, pp 564–569
Bader-El-Den M, Teitei E, Perry T (2018) Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netw Learn Syst
Bannour H, Hudelot C (2012) Hierarchical image annotation using semantic hierarchies. In: Proceedings of the 21st ACM international conference on Information and knowledge management, pp 2431–2434
Batista G, Prati R, Monard M (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57(1):289–300
Bennett PN, Nguyen N (2009) Refined experts: improving classification in large taxonomies. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp 11–18
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority oversampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining, Bangkok, Thailand, pp 475–482
Castellanos FJ, Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR (2018) Oversampling imbalanced data in the string space. Pattern Recogn Lett 103:32–38
Cesa-Bianchi N, Valentini G (2009) Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. In: Machine learning in systems biology, pp 14–29
Cesa-Bianchi N, Re M, Valentini G (2012) Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach Learn 88(1–2):209–241
Charte F, Rivera A, del Jesus MJ, Herrera F (2013) A first approach to deal with imbalance in multi-label datasets. In: Proceedings of the international conference on hybrid artificial intelligence systems, pp 150–160
Charte F, Rivas AJR, del Jesus M, Herrera F (2014) MLeNN: a first approach to heuristic multilabel undersampling. In: Proceedings of the international conference on intelligent data engineering and automated learning, pp 1–9
Charte F, Rivera A, del Jesus M, Herrera F (2015a) Addressing imbalance in multilabel classification: measures and random resampling algorithms. J Neurocomputing 163:3–16
Charte F, Rivera A, del Jesus M, Herrera F (2015b) MLSMOTE: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397
Charuvaka A, Rangwala H (2015) Hiercost: improving large scale hierarchical classification with cost sensitive learning. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 675–690
Chawla N, Bowyer K, Hall L, Kegelmeyer P (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chen B, Hu J (2010) Hierarchical multi-label classification incorporating prior information for gene function prediction. In: 2010 10th International conference on intelligent systems design and applications. IEEE, pp 231–236
Chen B, Hu J (2012) Hierarchical multi-label classification based on over-sampling and hierarchy constraint for gene function prediction. IEEJ Trans Electr Electron Eng 7(2):183–189
Chen B, Duan L, Hu J (2012) Composite kernel based SVM for hierarchical multi-label gene function classification. In: Proceedings of the international joint conference on neural networks (IJCNN). IEEE, pp 1–6
Cieslak DA, Hoens TR, Chawla NV, Kegelmeyer WP (2012) Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Disc 24(1):136–158
Colonna JG, Gama J, Nakamura EF (2018) A comparison of hierarchical multi-output recognition approaches for anuran classification. Mach Learn 107(11):1651–1671
Defferrard M, Benzi K, Vandergheynst P, Bresson X (2017) FMA: A dataset for music analysis. In: Proceedings of the international society for music information retrieval conference, Suzhou, China, pp 316–323
Diamantini C, Potena D (2009) Bayes vector quantizer for class-imbalance problem. IEEE Trans Knowl Data Eng 21(5):638–651
Dimitrovski I, Kocev D, Loskovska S, Dzeroski S (2011) Hierarchical annotation of medical images. Pattern Recogn 44(10):2436–2449
Dumais S, Chen H (2000) Hierarchical classification of web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp 256–263
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64
Fagni T, Sebastiani F (2007) On the selection of negative examples for hierarchical text categorization. In: Proceedings of the language & technology conference, pp 24–28
Fernández A, LóPez V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl Based Syst 42:97–110
García-Pedrajas N, Pérez-Rodríguez J, García-Pedrajas M, Ortiz-Boyer D, Fyfe C (2012) Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl Based Syst 25(1):22–34
Gopal S, Yang Y (2015) Hierarchical Bayesian inference and recursive regularization for large-scale classification. ACM Trans Knowl Discov Data 9(3):1–23
Ha-Thuc V, Renders JM (2011) Large-scale hierarchical text classification without labelled data. In: Proceedings of the fourth ACM international conference on Web search and data mining, pp 685–694
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new oversampling method in imbalanced datasets learning. In: International conference on intelligent computing. Hefei, China, pp 878–887
Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516
Hastie T, Tibshirani R (1998) Classification by pairwise coupling. Adv Neural Inf Process Syst 11(1):507–513
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference neural networks, Hong Kong, pp 1322–1328
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Jeni LA, Cohn JF, De La Torre F (2013) Facing imbalanced data: recommendations for the use of performance metrics. In: Proceedings of the humaine association conference on affective computing and intelligent interaction, pp 245–251
Jung SH, Bang H, Young S (2005) Sample size calculation for multiple testing in microarray data analysis. Biostatistics 6(1):157–169
Kiritchenko S, Matwin S, Famili F (2005) Functional annotation of genes using hierarchical text categorization. In: Proceedings of the ACL workshop on linking biological literature, Detroit, USA
Kocev D, Vens C, Struyf J, Džeroski S (2013) Tree ensembles for predicting structured outputs. Pattern Recogn 46(3):817–833
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Kumar S, Rowley HA, Wang X, Rodrigues JJM (2015) Hierarchical classification in credit card data extraction. US Patent 9,213,907
Li D, Ju Y, Zou Q (2016) Protein folds prediction with hierarchical structured SVM. Curr Proteom 13(2):79–85
Mani I, Zhang I (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, Washington DC, USA, vol 126
McNamara DS, Crossley SA, Roscoe RD, Allen LK, Dai J (2015) A hierarchical classification approach to automated essay scoring. Assess Writ 23:35–59
Mieth B, Kloft M, Rodríguez JA, Sonnenburg S, Vobruba R, Morcillo-Suárez C, Farré X, Marigorta UM, Fehr E, Dickhaus T (2016) Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci Rep 6:36671
Mukaka MM (2012) A guide to appropriate use of correlation coefficient in medical research. Malawi Med J 24(3):69–71
Naik A, Rangwala H (2016) Large-scale hierarchical classification with rare categories and inconsistencies. AI Matters 2(3):27–29
Naik A, Rangwala H (2018) Large scale hierarchical classification: state of the art. Springer, Berlin
Naik A, Rangwala H (2019) Improving large-scale hierarchical classification by rewiring: a data-driven filter based approach. J Intell Inf Syst 52(1):141–164
Nakano FK, Lietaert M, Vens C (2019) Machine learning for discovering missing or wrong protein function annotations. BMC Bioinform 20(1):485
Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. International conference on rough sets and current trends in computing, Warsaw, Poland, pp 158–167
Notaro M, Schubach M, Robinson PN, Valentini G (2017) Prediction of human phenotype ontology terms by means of hierarchical ensemble methods. BMC Bioinform 18(1):449
Obozinski G, Lanckriet G, Grant C, Jordan MI, Noble WS (2008) Consistent probabilistic outputs for protein function prediction. Genome Biol 9(1):S6
Paes BC, Plastino A, Freitas AA (2012) Improving local per level hierarchical classification. J Inf Data Manag 3(3):394–394
Partalas I, Kosmopoulos A, Baskiotis N, Artières T, Paliouras G, Gaussier É, Androutsopoulos I, Amini M, Gallinari P (2015) LSHTC: a benchmark for large-scale text classification. CoRR abs/1503.08581
Pereira RM, da Costa YMG, Silla Jr CN (2018) Dealing with imbalanceness in hierarchical multi-label datasets using multi-label resampling techniques. In: IEEE 30th international conference on tools with artificial intelligence (ICTAI), pp 818–824
Pereira RM, Costa YM, Silla CN Jr (2020) MLTL: a multi-label approach for the Tomek link undersampling algorithm. Neurocomputing 383:95–105
Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141
Roy A, Cruz RMO, Sabourin R, Cavalcanti GDC (2018) A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 286:179–192
Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M et al (2004) The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 32(18):5539–5545
Sarnal Barbedo JG, Lopes A (2006) Automatic genre classification of musical signals. EURASIP J Adv Signal Process 2007(1):064960
Schietgat L, Vens C, Struyf J, Blockeel H, Kocev D, Džeroski S (2010) Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinform 11(1):1–14
Silla CN Jr, Freitas AA (2009) Novel top-down approaches for hierarchical classification and their application to automatic music genre classification. In: 2009 IEEE international conference on systems, man and cybernetics. IEEE, pp 3499–3504
Silla CN Jr, Freitas AA (2011) A survey of hierarchical classification across different application domains. Data Min Knowl Disc 22(1–2):31–72
Sitompul OS, Nababan EB et al (2018) Biased support vector machine and weighted-smote in handling class imbalance problem. Int J Adv Intell Inform 4(1):21–27
Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Proceedings of the Australasian joint conference on artificial intelligence, pp 1015–1021
Soleymani R, Granger E, Fumera G (2020) F-measure curves: a tool to visualize classifier performance under imbalance. Pattern Recogn 100:107146
Song Y, Roth D (2014) On dataless hierarchical text classification. In: Twenty-eighth AAAI conference on artificial intelligence
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discovery, Italy, Turin, pp 283–292
Stein RA, Jaques PA, Valiati JF (2019) An analysis of hierarchical text classification using word embeddings. Inf Sci 471:216–232
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378
Szalkai B, Grolmusz V, Hancock J (2018) Seclaf: a webserver and deep neural network design tool for hierarchical biological sequence classification. Bioinformatics 1:3
Tang H, Wang Y, Tang S, Chu D, Li C (2019) A randomized clustering forest approach for efficient prediction of protein functions. IEEE Access 7:12360–12372
Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6(6):448–452
Tsoumakas G, Vlahavas I (2007) Random k-labelsets: an ensemble method for multilabel classification. In: European conference on machine learning. Springer, pp 406–417
Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185
Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern Part B (Cybern) 42(4):1119–1130
Xu C, Geng X (2019) Hierarchical classification based on label distribution learning. Proc AAAI Conf Artif Intell 33:5533–5540
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Yu L, Zhou R, Tang L, Chen R (2018) A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data. Appl Soft Comput 69:192–202
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3):321–334
Zhou ZH, Liu XY (2010) On multi-class cost-sensitive learning. Comput Intell 26(3):232–257
Acknowledgements
We thank the Brazilian Research Support Agencies: Coordination for the Improvement of Higher Education Personnel (CAPES), National Council for Scientific and Technological Development (CNPq) and Araucaria Foundation (FA) for their financial support. We also thank the anonymous reviewers and the Action Editor Grigorios Tsoumakas for their valuable feedback on the earlier versions of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Grigorios Tsoumakas.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A
Appendix A
In the appendix we present all the Tables of classification and metrics results generated in the experiments of this work, which were summarized into charts in the main part of paper. In Tables 24–27, the lines in italic represent the average ranking of the approaches. Besides the raw results we also present here the Tables of the statistics, which were applied over the results in order to give statistical background in the responses of the Analysis and Discussion section (Tables 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 and 32).
Rights and permissions
About this article
Cite this article
Pereira, R.M., Costa, Y.M.G. & Silla, C.N. Handling imbalance in hierarchical classification problems using local classifiers approaches. Data Min Knowl Disc 35, 1564–1621 (2021). https://doi.org/10.1007/s10618-021-00762-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00762-8