Abstract
Data used in particle physics analyses have an imbalanced nature in which the events of interest are rare due to the broad background. These events can be identified from bulk by intensive computational studies including application of sophisticated analysis techniques. Classification algorithms provided by supervised machine learning (ML) approaches can be utilized to interpret skewed particle dataset as an alternative to the classic techniques even for multi particle state analysis. In this study, the ground state of the bottomonium (\(\varUpsilon \)(1 S)) and its excited states (\(\varUpsilon \)(2 S) and \(\varUpsilon \)(3 S)) were studied by application of multiclass classification approach based on random forest classifier (RFC) which is a novel ML approach example in particle analysis with implementation of resampling techniques for preprocessing dataset and modification of the weighting strategy. For this purpose, five widely used oversampling and two hybrid strategies, using over and under resampling together, were adjusted to RFC. Moreover, class weights applied RFC, weighted random forest (WRF), was used in the analysis. Due to the data structure, performance of the applied models was evaluated by the derivatives of confusion matrix. It is revealed that hybrid techniques implemented in RFC is suitable for handling highly imbalanced classes. G-mean and BAcc scores of upsilon states presented that with SMOTETomek strategy the model exhibited highest classification achievement, around 90\(\%\), with high sensitivity implying the success of the application on multiclass classification.



Similar content being viewed by others
References
Susan, S., Kumar, A.: The balancing trick: optimized sampling of imbalanced datasets—a brief survey of the recent state of the art. Eng Rep 3, 12298 (2020)
Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 02, 42 (2012)
Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30, 25 (2006)
Visa, S., Ralescu, A.: Issues in mining imbalanced data sets—a review paper. In: Proceedings of the midwest artificial intelligence and cognitive science conference (2005)
Nguyen, G.H., Bouzerdoum, A., Phung, S.L.: Learning pattern classification tasks with imbalanced data sets. In: Pattern Recognition, Peng-Yeng Yin. ISBN 978-953-307-014-8 (2009)
Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recognit. Artif. Intell. 23, 687 (2009)
ALICE Collaboration: Measurements of the dielectron continuum in pp, p-Pb and Pb–Pb collisions with ALICE at the LHC. Nucl. Phys. A 967, 684 (2017)
Alves, A.: Stacking machine learning classifiers to identify Higgs Bosons at the LHC. J. Instrum. 12, T05005 (2017)
Kuzu, S.Y.: J/\(\psi \) production with machine learning at the LHC. Eur. Phys. J. Plus 137, 392 (2022)
Breiman, L.: Random forests. Mach. Learn. 45, 5 (2001)
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220 (2017)
Chen, C., Liaw, A., Breiman, L.: Using Random Forest to Learn Imbalanced Data, vol. 666. University of California, Berkeley (2004)
Chawla, N.V.K., Bowyer, W., Hall, L.O., Kegelmeyer, W.P., Chawla, N.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321 (2002)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, p. 1322 (2008)
Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv. Intell. Comput. 3644, 878 (2005)
Nguyen, H.M., Coope, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 03, 4 (2009)
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 06, 2 (2004)
Karsch, F., Laermann, E.: Thermodynamics and in-medium hadron properties from lattice QCD. Quark-Gluon Plasma III, 1 (2004)
Shuryak, E.V.: Theory of hadronic plasma. Sov. Phys. JETP 47, 212 (1978)
ALICE Collaboration: Upsilon production in Pb–Pb and p–Pb collisions at forward rapidity with ALICE at the LHC. J. Phys. Conf. Ser. 509, 012112 (2014)
Matsui, T., Satz, H.: J/\(\psi \) suppression by quark-gluon plasma formation. Phys. Lett. B 178, 416 (1986)
Digal, S., Petreczky, P., Satz, H.: Quarkonium feed down and sequential suppression. Phys. Rev. D 64, 094015 (2001)
Brambilla, N., Ghiglieri, J., Vairo, A., Petreczky, P.: Static quark-antiquark pairs at finite temperature. Phys. Rev. D 78, 014017 (2008)
Brambilla, N., Escobedo, M.A., Ghiglieri, J., Soto, J., Vairo, A.: Heavy quarkonium in a weakly-coupled quark-gluon plasma below the melting temperature. JHEP 09, 038 (2010)
CMS Collaboration: Measurement of nuclear modification factors of \(\Upsilon (1S)\), \(\Upsilon (2S)\), and \(\Upsilon (3S)\) mesons in PbPb collisions at \(\sqrt{s_{NN}}= 5.02\) TeV. Phys. Lett. B 790, 270 (2019)
Collaboration, S.T.A.R.: Suppression of upsilon production in d+Au and Au+Au collisions at \(\sqrt{s_{NN}}=200\) GeV. Phys. Lett. B 735, 127 (2014)
Collaboration, C.M.S.: Observation of sequential \(\Upsilon \) suppression in Pb–Pb collisions. Phys. Rev. Lett. 109, 222301 (2012)
CMS Collaboration: Suppression of \(\Upsilon (1S)\), \(\Upsilon (2S)\), and \(\Upsilon (3S)\) production in PbPb collisions at \(\sqrt{s_{NN}}=2.76\) TeV. Phys. Lett. B 770, 357 (2017)
Collaboration, C.L.E.O.: Dielectron widths of the \(\Upsilon (1S)\), \(\Upsilon (2S)\), and \(\Upsilon (3S)\) resonances. Phys. Rev. Lett. 96, 092003 (2006)
Collaboration, C.L.E.O.: Recent upsilonium results from CLEO III. AIP Conf. Proc. 870, 356 (2006)
STAR Collaboration: Upsilon production in U+U collisions at 193 GeV with the STAR experiment. Phys. Rev. C 94 (2016)
ALICE Collaboration: \(\Upsilon \) production and nuclear modification at forward rapidity in Pb–Pb collisions at \(\sqrt{s_{NN}}=5.02\) TeV. Phys. Lett. B 822, 136579 (2021)
Olive, K.A., et al.: Particle data group. Chin. Phys. C 38, 090001 (2014)
Tanabashi, M., et al.: Particle data group. Phys. Rev. D 98, 030001 (2018)
Nourbakhsh, S.: Studio degli eventi J/\(\Psi \) in due elettroni con i primi dati di CMS. Ph.D. Thesis, Sapienza University of Rome (2010)
STAR Collaboration: \(\Upsilon \) measurement in STAR. Int. J. Mod. Phys. E 16 (2007)
ALICE Collaboration: Differential studies of inclusive J/\(\psi \) and \(\Upsilon (2S)\) production at forward rapidity in Pb–Pb collisions at \(\sqrt{s_{NN}}=2.76\) TeV. JHEP 5, 179 (2016)
ALICE Collaboration: \(\Upsilon \) suppression at forward rapidity in Pb–Pb collisions at \(\sqrt{s_{NN}}=5.02\) TeV. Phys. Lett. B 790, 89 (2019)
Muller, A.C., Guido, S.: Introduction to Machine Learning with Python. O’Reilly Media Inc. ISBN 978-1-449-36941-5 (2016)
Tharwat, A.: Classification assessment methods. Appl. Comput. Inform. 17, 168–172 (2020)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41, 15 (2009)
Krawczyk, B., Galar, M., Jelen, L., Herrera, F.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714 (2016)
Vuttipittayamongkol, P., Elyan, E.: Overlap-based undersampling method for classification of imbalanced medical datasets. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, p. 358 (2020)
Vuttipittayamongkol, P., Elyan, E.: Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease. Int. J. Neural Syst. 30, 2050043 (2020)
Zhang, X., Zhuang, Y., Wang, W., Pedrycz, W.: Transfer boosting with synthetic instances for class imbalanced object recognition. IEEE Trans. Cybern. 48, 357 (2018)
Elyan, E., Jamieson, L., Ali-Gombe, A.: Deep learning for symbols detection and classification in engineering drawings. Neural Netw. 129, 91 (2020)
Lin, W., Wu, Z., Lin, L., Wen, A., Li, J.: An ensemble random forest algorithm for insurance big data analysis. IEEE Access 5, 16568 (2017)
Yi-Hung, L., Yen-Ting, C.: Total margin based adaptive fuzzy support vector machines for multiview face recognition. In: 2005 IEEE International Conference on Systems, Man and Cybernetics (2005)
Li, Y., Sun, G., Zhu, Y.: Data imbalance problem in text classification. In: Third International Symposium on Information Processing, p. 301 (2010)
Sun, Y., Wang, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recognit Artif. Intell. 23, 687 (2009)
Daskalaki, S., Kopanas, I., Avouris, N.: Evaluation of classifiers for an uneven class distribution problem. Appl. Artif. Intell. 20, 381 (2006)
Chen, X.W., Wasikowski, M.: Fast: a ROS-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 124 (2008)
Koziarski, M.: Radial-based undersampling for imbalanced data classification. Pattern Recognit. 102, 107262 (2020)
Trzcinski, T., Graczykowski, L., Glinka, M.: Using random forest classifier for particle identification in the ALICE experiment. In: Advances in Intelligent Systems and Computing (2019)
Trzcinski, T., Deja, K.: Assigning quality labels in the high-energy physics experiment ALICE using machine learning algorithms. Acta Phys. Polon. Suppl. A 11, 647 (2018)
Azhari, M., Alaoui, A., Achraoui, Z., Ettaki, B., Zerouaoui, J.: Adaptation of the random forest method. Procedia Comput. Sci. 170, 1141 (2020)
Azhari, M., Alaoui, A., Abarda, A., Ettaki, B., Zerouaoui, J.: Using ensemble methods to solve the problem of pulsar search. In: Farhaoui, Y. (ed.) Big Data and Networks Technologies, Lecture Notes in Networks and Systems. Springer. ISBN 978-3030236717 (2020)
Azhari, M., Alaoui, A., Abarda, A., Ettaki, B., Zerouaoui, J.: A comparison of random forest methods for solving the problem of pulsar search. In: The Fourth International Conference on Smart City Applications. Springer. ISBN 978-3030539283 (2020)
Vuttipittayamongkol, P., Elyan, E., Petrovski, A.: On the class overlap problem in imbalanced data classification. Knowl. Based Syst. 212, 106631 (2021)
Zhu, M., Xia, J., Yan, M., Cai, G., Yan, J., Ning, G.: Dimensionality reduction in complex medical data: improved self-adaptive niche genetic algorithm. Comput. Math. Methods Med. 2015 (2015)
Xia, B., Jiang, H., Liu, H., Yi, D.: A novel hepatocellular carcinoma image classification method based on voting ranking random forests. Comput. Math. Methods Med. 2016 (2016)
Zhang, C., Li, Y., Yu, Z.,Tian, F.: A weighted random forest approach to improve predictive performance for power system transient stability assessment. In: 2016 IEEE PES Asia-Pacific Power and Energy Engineering Conference (APPEEC), p. 1259 (2016)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825 (2011)
Aung, W.T., Myanmar, Y., Saw Hla, K.H.M.: Random forest classifier for multi-category classification of web pages. In: 2009 IEEE Asia-Pacific Services Computing Conference (APSCC), p. 372 (2009)
Gajowniczek, K., Grzegorczyk, I., Zabkowski, T., Bajaj, C.: Weighted random forests to improve arrhythmia classification. Electronics 9, 99 (2020)
Yang, H., Li, X., Cao, H., Cui, Y., Luo, Y., Liu, J., Zhang, Y.: Using machine learning methods to predict hepatic encephalopathy in cirrhotic patients with unbalanced data. Comput. Methods Programs Biomed. 211, 106420 (2021)
Thammasiri, D., Delen, D., Meesad, P., Kasap, N.: A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition. Expert Syst. Appl. 41, 321 (2014)
Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 1 (2017)
Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modelling under imbalanced distributions. arXiv:1505.01658
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for classimbalance learning. IEEE Trans. Syst. Man Cybern. B 39, 539 (2009)
Fernández, A., García, S.R., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer, Cham. ISBN 978-3-319-98073-7 (2018)
Kubat, M., Matwin, S.: Addressing the course of imbalanced training-sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, p. 179 (1997)
Chawla, N.V.: Data mining for imbalanced datasets. An overview. In: Data Mining and Knowledge Discovery Handbook, p. 853. Springer US (2005)
Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2, 42 (2012)
Ashraf, S., Saleem, S., Ahmed, T., Aslam, Z., Muhammad, D.: Conversion of adverse data corpus to shrewd output using sampling metrics. Visual Comput. Ind. Biomed. Art 3, 19 (2020)
Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 14, 106 (2013)
He, H., Ma, Y. (eds.): Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press. ISBN 978-1-118-07462-6 (2013)
Cover, M.T., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21 (1967)
Wang, S., Dai, Y., Shen, J., et al.: Research on expansion and classification of imbalanced data based on SMOTE algorithm. Sci. Rep. 11, 24039 (2021)
Fernandez, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61 (2018)
Sevastianov, L.A., Shchetinin, E.Y.: On methods for improving the accuracy of multiclass classification on imbalanced data. Inform. Primen. 14, 63 (2020)
Mukherjee, M., Khushi, M.: SMOTE-ENC: a novel SMOTE-based method to generate synthetic data for nominal and continuous features. Appl. Syst. Innov. 4, 18 (2021)
Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM 29, 1213 (1986)
Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Lecture Notes in Computer Science. Springer Berlin Heidelberg, Berlin. ISBN 9783540282266 (2005)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273 (1995)
Hussain, S., Raza, Z., Giacomini, G., Goswami, N.: Support vector machine-based classification of vasovagal syncope using head-up tilt test. Biology 10, 1029 (2021)
Wang, L.: Support Vector Machines: Theory and Applications. Springer, Berlin, Heidelberg. ISBN 978-3-540-24388-5 (2005)
Evgeniou T., Pontil, M.: Support vector machines: theory and applications. In: Machine Learning and Its Applications, Advanced Lectures (2001)
Wong, G.Y., Leung, F.H.F., Ling, S.H.: A hybrid evolutionary preprocessing method for imbalanced datasets. Inf. Sci. 454–455, 161–177 (2018)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C 42, 463–484 (2012)
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Syst. 40, 185–197 (2010)
Zhang, Y., Zhang, D., Mi, G., Ma, D., Li, G., Guo, Y., Li, M., Zhu, M.: Using ensemble methods to deal with imbalanced data in predicting protein–protein interactions. Comput. Biol. Chem. 36, 36–41 (2012)
Cao L., Zhai Y.: Imbalanced data classification based on a hybrid resampling SVM method. In: 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), pp. 1533–1536 (2015)
Le, T., Lee, M., Park, J., Baik, S.: Oversampling techniques for bankruptcy prediction: novel features from a transaction dataset. Symmetry 10, 79 (2018)
Le, T., Baik, S.: A robust framework for self-care problem identification for children with disability. Symmetry 11, 89 (2019)
Le, T., Vo, M.T., Vo, B., Lee, M.Y., Baik, S.W.: A hybrid approach using oversampling technique and cost-sensitive learning for bankruptcy prediction. Complexity 2019, 1 (2019)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC 2, 408 (1972)
Xu, Z., Shen, D., Nie, T., Kou, Y.: A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. J. Biomed. Inform. 107, 103465 (2020)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Commun. SMC 6, 769 (1976)
McCauley, T.: CMS releases open data for Machine Learning. https://cms.cern/news/cms-releases-open-data-machine-learning
McCauley, T.: Events with two electrons from 2010. https://opendata.cern.ch/record/304
McCauley, T.: \(\Upsilon \) to two electrons from 2010. https://opendata.cern.ch/record/305
Racz, A., Bajusz, D., Heberger, K.: Effect of dataset size and train/test split ratios in QSAR/QSPR multiclass classification. Molecules 26, 1111 (2021)
Probst, P., Boulestei, A.L.: To tune or not to tune the number of trees in random forest. J. Mach. Learn. Res. 18, 6673 (2017)
Ozigis, M.S., Kaduk, J.D., Jarvis, C.H., Balzter, H.: Detection of oil pollution impacts on vegetation using multifrequency SAR, multispectral images with fuzzy forest and random forest methods. Environ. Pollut. 256 (2020)
NA61/SHINE Collaboration: Two-particle correlations in azimuthal angle and pseudorapidity in inelastic p+p interactions at the CERN Super Proton Synchrotron. Eur. Phys. J. C 77, 59 (2017)
Visa, S.: Fuzzy classifiers for imbalanced data sets. Ph.D. Thesis, Univeristy of Cincinnati: Cincinnati (2006)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6, 429 (2002)
Wang, L.X., Mendel, J.M.: Generating fuzzy rules by learning from examples. IEEE Trans. Syst. Man Cybern. 22, 1414 (1992)
Ali, A., Shamsuddin, S.M., Ralescu, A.L.: Classification with class imbalance problem: a review. Int. J. Adv. Soft Comput. Appl. 7, 176 (2015)
Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation. Adv. Artif. Intell. 4304, 1015 (2006)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45, 427 (2009)
Garcia, V., Mollineda, R.A., Sanchez, J.S.: Theoretical analysis of a performance measure for imbalanced data. In: 20th International Conference on Pattern Recognition, p. 617 (2010)
Powers, D.M.: Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. J. Mach. Learn. Technol. 2, 37 (2011)
Vuttipittayamongkol, P., Elyan, E.: Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf. Sci. 509, 47 (2020)
Boughorbel, S., Jarray, F., El-Anbari, M.: Optimal classifier for imbalanced data using Matthews correlation coefficient metric. PLoS ONE 12 (2017)
Brodersen, K.H. , Ong, C.S., Stephan, K.E., Buhmann, J.M.: The balanced accuracy and its posterior distribution. In: 20th International Conference on Pattern Recognition, p. 3121 (2010)
Akosa, J.S.: Predictive accuracy: a misleading performance measure for highly imbalanced data. In: Proceedings of The SAS Global Forum 2017 Conference, p. 942 (2017)
Acknowledgements
The author acknowledges the support from the Scientific and Technological Research Council of Turkey (TUBITAK) Project No. 119F302.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yalcin Kuzu, S. Random Forest Based Multiclass Classification Approach for Highly Skewed Particle Data. J Sci Comput 95, 21 (2023). https://doi.org/10.1007/s10915-023-02144-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-023-02144-2
Keywords
- Imbalanced dataset
- Multiclass classification
- Random forest classifier
- Resampling
- Upsilon states
- Weighted random forest classifier