Abstract
We present two novel theorems that allow for estimating the weight parameter in (weighted) kNN while dealing with imbalanced data. More precisely, the theorems for G-mean and F1-score are presented. The theorems assume ‘totally random’ distribution i.e. lack of dependency between features and class value.
These results can be used for setting the default weights of classes for kNN-type classifiers, e.g. for imbalanced learning problems. Moreover, these theorems taken together illustrate the fact that without a precise specification of the particular performance measure we are interested in, the ‘best classifier’ term can be ambiguous or even misleading.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
It corresponds to the case \(F_\beta \text {-} measure\) when \(\beta =1\).
- 2.
\(I(C(x)=d)=1-L(C(x),d)\), where L is the 0–1 loss function (equal to 0, if C correctly classifies the given example (x, d) and 1, otherwise).
- 3.
Formally, \(\Pr _{(x,d)\sim \mathcal {D}}(Alg(trnSet)(x)=d)\) is a random variable, where trnSet is fixed. It can also be seen as the conditional probability on \(\mathcal {D}^{n} \times \mathcal {D}\) given a training set trnSet, i.e. \(\Pr _{trnSet\sim \mathcal {D}^{n}, (x,d)\sim \mathcal {D}}(Alg(trnSet)(x)=d \mid trnSet)\).
- 4.
Here and later for F1-score, we determine the average performance using averaged values of the submeasures. One could also consider computing the average of these performance measures directly, what seems more appropriate. In such case the analogous computations of the proofs seem to be much more hard mathematical task.
- 5.
In practice, such an assumption is satisfied for a wide range of values of the parameter p.
- 6.
This assumption is important only from the technical point of view. If it is not satisfied we should consider the percentage of the minority class coming from borderline region (value q from theorem).
- 7.
Only mammography data set is not publicly available and was supported by Nitesh Chawla [1].
- 8.
It is reasonable that for imbalanced data p should be not greater than 0.5 since minority class should have greater weight than majority class.
References
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Domingos, P.: Unifying instance-based and rule-based induction. Mach. Learn. 24(2), 141–168 (1996). https://doi.org/10.1007/BF00058656
Dubey, H., Pudi, V.: Class based weighted k-nearest neighbor over imbalance dataset. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 305–316. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_26
G. Góra, Skowron, A., Wojna, A.: Explainability in RIONA algorithm combining rule induction and instance-based learning. In: Ganzha, M., Maciaszek, L.A., Paprzycki, M., Ślȩzak, D. (eds.) Proceedings of the 18th Conference on Computer Science and Intelligence Systems, FedCSIS 2023, Warsaw, Poland, September 17–20, 2023. Annals of Computer Science and Information Systems, vol. 31, pp. 485–496. IEEE (2023). https://annals-csis.org/proceedings/2023/
Góra, G.: Combining instance-based learning and rule-based methods for imbalanced data. Ph.D. thesis, University of Warsaw, Warsaw (2022). https://www.mimuw.edu.pl/sites/default/files/gora_grzegorz_rozprawa_doktorska.pdf
Hamza, K.: The smallest uniform upper bound on the distance between the mean and the median of the binomial and Poisson distributions. Stat. Probab. Lett. 23(1), 21–25 (1995). https://doi.org/10.1016/0167-7152(94)00090-U
He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, Piscataway, NJ, 1st edn. (2013). https://doi.org/10.1002/9781118646106
Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml
Maennel, H., et al.: What do neural networks learn when trained with random labels? (2020). https://doi.org/10.48550/ARXIV.2006.10455
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Napierała, K.: Improving Rule Classifiers For Imbalanced Data. Ph.D. thesis, Poznań University of Technology, Poznań (2012)
Napierała, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inf. Syst. 46(3), 563–597 (2016). https://doi.org/10.1007/s10844-015-0368-1
Raeder, T., Forman, G., Chawla, N.V.: Learning from imbalanced data: evaluation matters. In: Holmes, D.E., Jain, L.C. (eds.) Data Mining: Foundations and Intelligent Paradigms. igms. Intelligent Systems Reference Library, vol. 23, pp. 315–331. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-23166-7_12
Wang, L., Khan, L., Thuraisingham, B.: An effective evidence theory based k-nearest neighbor (KNN) classification. In: 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 1, pp. 797–801. IEEE (2008)
Wolpert, D.H.: The supervised learning no-free-lunch theorems. In: Roy, R., Klöppen, M., Ovaska, S., Furuhashi, T., Hoffmann, F. (eds.) Soft Computing and Industry, pp. 25–42. Springer, London (2002). https://doi.org/10.1007/978-1-4471-0123-9_3
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Góra, G., Skowron, A. (2023). On kNN Class Weights for Optimising G-Mean and F1-Score. In: Campagner, A., Urs Lenz, O., Xia, S., Ślęzak, D., Wąs, J., Yao, J. (eds) Rough Sets. IJCRS 2023. Lecture Notes in Computer Science(), vol 14481. Springer, Cham. https://doi.org/10.1007/978-3-031-50959-9_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-50959-9_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50958-2
Online ISBN: 978-3-031-50959-9
eBook Packages: Computer ScienceComputer Science (R0)