Abstract
Angular Minkowski p-distance is a dissimilarity measure that is obtained by replacing Euclidean distance in the definition of cosine dissimilarity with other Minkowski p-distances. Cosine dissimilarity is frequently used with datasets containing token frequencies, and angular Minkowski p-distance may potentially be an even better choice for certain tasks. In a case study based on the 20-newsgroups dataset, we evaluate classification performance for classical weighted nearest neighbours, as well as fuzzy rough nearest neighbours. In addition, we analyse the relationship between the hyperparameter p, the dimensionality m of the dataset, the number of neighbours k, the choice of weights and the choice of classifier. We conclude that it is possible to obtain substantially higher classification performance with angular Minkowski p-distance with suitable values for p than with classical cosine dissimilarity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44503-x_27
Dudani, S.A.: An experimental study of moment methods for automatic identification of three-dimensional objects from television images. Ph.D. thesis, The Ohio State University (1973)
Dudani, S.A.: The distance-weighted \(k\)-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(4), 325–327 (1976)
Fix, E., Hodges, Jr, J.: Discriminatory analysis — nonparametric discrimination: Consistency properties. Technical report 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas (1951). https://apps.dtic.mil/sti/citations/ADA800276
France, S.L., Carroll, J.D., Xiong, H.: Distance metrics for high dimensional nearest neighborhood recovery: compression and normalization. Inf. Sci. 184(1), 92–110 (2012)
Hand, D.J., Till, R.J.: A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45(2), 171–186 (2001)
Jensen, R., Cornelis, C.: A new approach to fuzzy-rough nearest neighbour classification. In: Chan, C.-C., Grzymala-Busse, J.W., Ziarko, W.P. (eds.) RSCTC 2008. LNCS (LNAI), vol. 5306, pp. 310–319. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88425-5_32
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical report CMS-CS-96-118, Carnegie Mellon University, School of Computer Science, Pittsburgh (1996)
Kaminska, O., Cornelis, C., Hoste, V.: Fuzzy rough nearest neighbour methods for detecting emotions, hate speech and irony. Inf. Sci. 625, 521–535 (2023)
Lenz, O.U.: Fuzzy rough nearest neighbour classification on real-life datasets. Doctoral thesis, Universiteit Gent (2023)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)
Rosner, B.S.: A new scaling technique for absolute judgments. Psychometrika 21(4), 377–381 (1956)
Salton, G.: Some experiments in the generation of word and document associations. In: Proceedings of the 1962 Fall Joint Computer Conference. AFIPS Conference Proceedings, vol. 22, pp. 234–250. Spartan Books (1962)
Acknowledgements
The research reported in this paper was conducted with the financial support of the Odysseus programme of the Research Foundation – Flanders (FWO).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lenz, O.U., Cornelis, C. (2023). Classifying Token Frequencies Using Angular Minkowski p-Distance. In: Campagner, A., Urs Lenz, O., Xia, S., Ślęzak, D., Wąs, J., Yao, J. (eds) Rough Sets. IJCRS 2023. Lecture Notes in Computer Science(), vol 14481. Springer, Cham. https://doi.org/10.1007/978-3-031-50959-9_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-50959-9_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50958-2
Online ISBN: 978-3-031-50959-9
eBook Packages: Computer ScienceComputer Science (R0)