Abstract
In recent years data stream mining and learning from imbalanced data have been active research areas. Even though solutions exist to tackle these two problems, most of them are not designed to handle challenges inherited from both problems. As far as we are aware, the few approaches in the area of learning from imbalanced data streams fall in the context of classification, and no efforts on the regression domain have been reported yet. This paper proposes a technique that uses sampling strategies to cope with imbalanced data streams in a regression setting, where the most important cases have rare and extreme target values. Specifically, we employ under-sampling and over-sampling strategies that resort to Chebyshev’s inequality value as a heuristic to disclose the type of incoming cases (i.e. frequent or rare). We have evaluated our proposal by applying it in the training of models by four well-known regression algorithms over fourteen benchmark data sets. We conducted a series of experiments with different setups on both synthetic and real-world data sets. The experimental results confirm our approach’s effectiveness by showing the models’ superior performance trained by each of the sampling strategies compared with their baseline pairs.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17:255–287
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Block HD (1988) The perceptron: a model for brain functioning I. Neurocomputing: foundations of research. MIT Press, Cambridge, pp 135–150
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49:1–50
Branco P, Torgo L, Ribeiro RP (2019) Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343:76–99
Brzezinski D, Minku LL, Pewinski T, Stefanowski J, Szumaczuk A (2021) The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowl Inf Syst 63(6):1429–1469
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Darrab S, Broneske D, Saake G (2021) Modern applications and challenges for rare itemset mining. Int J Mach Learn Comput. https://doi.org/10.18178/ijmlc.2021.11.3.1037
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Dua D, Graff C (2017) UCI machine learning repository
Duarte J, Gama J, Bifet A (2016) Adaptive model rules from high-speed data streams. ACM Trans Knowl Discov Data 10(3):30:1-30:22
Finch T (2009) Incremental calculation of weighted mean and variance. http://nfs-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/antiforgery/stats.pdf
Gabsi N (2011) Extension et interrogation de résumés de flux de données. Ph. D. thesis, Télécom ParisTech
Gama J (2010) Knowledge discovery from data streams. Chapman and Hall/CRC data mining and knowledge discovery series. CRC Press, Boca Raton
Gama J, Sebastião R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346
Ghazikhani A, Monsefi R, Yazdi HS (2013) Recursive least square perceptron model for non-stationary and imbalanced data stream classification. Evol Syst 4(2):119–131
Ghazikhani A, Monsefi R, Yazdi HS (2014) Online neural network model for non-stationary and imbalanced data stream classification. Int J Mach Learn Cybern 5(1):51–62
Godase A, Attar V (2012) Classifier ensemble for imbalanced data stream classification. In: Proceedings of the CUBE international information technology conference, pp 284–289
Grzyb J, Klikowski J, Woźniak M (2021) Hellinger distance weighted ensemble for imbalanced data stream classification. J Comput Sci 51:101314
Ikonomovska E, Gama J, Dzeroski S (2011) Learning model trees from evolving data streams. Data Min Knowl Discov 23(1):128–168
Korycki Ł, Krawczyk B (2021) Concept drift detection from multi-class imbalanced data streams. arXiv preprint arXiv:2104.10228
Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Icml, vol 97. Nashville, pp 179–186
Lee SS (2000) Noisy replication in skewed binary classification. Comput Stat Data Anal 34(2):165–191
Maglie A (2016) ReactiveX and RxJava, pp 1–9
Moniz N, Ribeiro R, Cerqueira V, Chawla N (2018) Smoteboost for regression: improving the prediction of extreme values. In: IEEE 5th international conference on data science and advanced analytics (DSAA). IEEE, pp 150–159
Reunanen N, Raty T, Jokinen JJ, Hoyt T, Culler D (2020) Unsupervised online detection and prediction of outliers in streams of sensor data. Int J Data Sci Anal 9:285–314
Ribeiro RP (2011) Utility-based regression. Ph. D. thesis, Dep. Computer Science, Faculty of Sciences, University of Porto
Ribeiro RP, Moniz N (2020) Imbalanced regression and extreme value prediction. Mach Learn 109(9–10):1803–1835
Torgo L, Ribeiro R (2007) Utility-based regression. In: European conference on principles of data mining and knowledge discovery. Springer, pp 597–604
Torgo L, Ribeiro RP, Pfahringer B, Branco P (2013) Smote for regression. In: Portuguese conference on artificial intelligence. Springer, pp 378–389
Wang S, Minku LL, Chawla NV, Yao X (2019) Learning from data streams and class imbalance. Connect Sci 31(2):103–104
Zhang Y, Liu W, Ren X, Ren Y (2017) Dual weighted extreme learning machine for imbalanced data stream classification. J Intell Fuzzy Syst 33(2):1143–1154
Zyblewski P, Ksieniewicz P, Woźniak M (2019) Classifier selection for highly imbalanced data streams with minority driven ensemble. In: International conference on artificial intelligence and soft computing. Springer, pp 626–635
Zyblewski P, Sabourin R, Wozniak M (2021) Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Inf Fusion 66:138–154
Acknowledgements
This research was funded from national funds through FCT - Science and Technology Foundation, in the context of the project FailStopper (DSAIPA /DS/0086/2018).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Fürnkranz.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Experimental evaluation tables
1.1 More about the data sets
Since we have used short names to refer to the data sets in the main text and make it easier for the readers to find the data sets, we report their complete name, source, and the number of their attributes in Table 8.
1.2 Results of the experiments: sensitivity analysis varying \(\phi \)
Tables 11, 12, 13, 14, 15 and 16 show results of the paired comparisons for the four learner models trained by our (Under/Over)-sampling strategies, ChebyUS and ChebyOS, against their baseline versions. The symbols \(\triangleright \) and \(\triangleleft \) indicate that the ChebyUS or ChebyOS method is significantly better or worse, respectively, compared to the baseline. In each experiment, the value obtained by the error function (i.e. \(RMSE_{\phi }\) or RMSE) for prediction of the learner mentioned in the column header over the data set specified in the corresponding row has been reported. The numbers inside the cells are the average and corresponding standard deviation of the results on ten rounds of experiments. We report the statistical significance (level 95%) of the difference between each pair using the two symbols \(\triangleright \) and \(\triangleleft \) pointing to the significantly better method. As we have reported results for different levels of \(thr_\phi \), information about the number/percentage of the rare cases observed in each data set presented in Tables 9 and 10.
Chebyshev’s probability, K value and relevance function \(\phi ()\) graphs for each data set
The value of function \(\phi ()\) to each example of the data sets along with their given probability by Algorithm 1, their given K value by Algorithm 2 and also the box plot of the data set have been shown in the following figures. The red points on the graphs indicate the rare cases considered in each data set.
Rights and permissions
About this article
Cite this article
Aminian, E., Ribeiro, R.P. & Gama, J. Chebyshev approaches for imbalanced data streams regression models. Data Min Knowl Disc 35, 2389–2466 (2021). https://doi.org/10.1007/s10618-021-00793-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00793-1