Abstract
In previous research, the authors found that the spectrogram of eigenvalues of combinatorial Laplacian of the document similarity matrix is relevant for tasks like graph spectral classification, clustering etc.. This paper investigates the hypothesis that this property can be attributed to the specific “style” of writing, that is to the distribution of words in the documents belonging to a given category of documents. The investigation is performed via generating artificial documents from a predefined parameterized word distribution. The document similarity matrices are computed and the spectrum of the corresponding combinatorial Laplacian is interrogated. The parameters are varied to determine their impact. We present the impact of these parameters on the shape of the spectrogram.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Baayen, R.H.: Statistical models for word frequency distributions: a linguistic evaluation. Comput. Humanit. 26(5-6), 347–363 (1992). https://doi.org/10.1007/BF00136980
Bandyapadhyay, S., Fomin, F.V., Golovach, P.A., Lochet, W., Purohit, N., Simonov, K.: How to find a good explanation for clustering? (2021). https://doi.org/10.48550/ARXIV.2112.06580. https://arxiv.org/abs/2112.06580
Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020)
Bobek, S., Kuk, M., Szela̧ż, M., Nalepa, G.: Enhancing cluster analysis with explainable AI and multidimensional cluster prototypes. IEEE Access 10, 101556–101574 (2022)
Borkowski, P., Kłopotek, M., Starosta, B., Wierzchoń, S., Sydow, M.: Eigenvalue based spectral classification. PLOS ONE 18(4), e0283413 (2023). https://doi.org/10.1371/journal.pone.0283413
Carroll, J.: On sampling from a lognormal model of word frequency distribution. In: Kurera, H., Francis, W. (eds.) Computational Analysis of Present-Day American English, pp. 406–424. Brown University Press, Providence (1967)
Chaudhuri, K., Chung, F., Tsiatas, A.: Spectral clustering of graphs with general degrees in the extended planted partition model. In: Mannor, S., Srebro, N., Williamson, R.C. (eds.) Proceedings of the 25th Annual Conference on Learning Theory. Proceedings of Machine Learning Research, Edinburgh, Scotland, vol. 23, pp. 35.1–35.23. PMLR (2012). https://proceedings.mlr.press/v23/chaudhuri12.html
Davidson, I., Livanos, M., Gourru, A., Walker, P., Velcin, J., Ravi, S.S.: Explainable clustering via exemplars: Complexity and efficient approximation algorithms. CoRR 2209.09670 (2022)
Kauffmann, J.R., Esders, M., Montavon, G., Samek, W., Müller, K.: From clustering to cluster explanations via neural networks. CoRR abs/1906.07633 (2019). http://arxiv.org/abs/1906.07633
Kłopotek, M., Wierzchon, S.T., Starosta, B., Czerski, D., Borkowski, P.: Dependence of spectrogram from graph spectral clustering in text document domain; under preparation (2024)
Kłopotek, M.A., Starosta, B., Wierzchoń, S.T.: Eigenvalue-based incremental spectral clustering (2023)
von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Luxburg, U.V.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007). https://doi.org/10.1007/s11222-007-9033-z
Macgregor, P., Sun, H.: A tighter analysis of spectral clustering, and beyond. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 14717–14742. PMLR (2022). https://proceedings.mlr.press/v162/macgregor22a.html
Mandelbrot, B.: An informational theory of the statistical structure of languages. In: Jackson, W. (ed.) Communication Theory, pp. 486–502. Academic Press, Princeton (1953)
Orlov, J., Chitashvili, R.: On the distribution of frequency spectrum in small samples from populations with a large number of events. Bull. Acad. Sci. Georgia 108(2), 297–300 (1982)
Penta, A., Pal, A.: What is this cluster about? explaining textual clusters by extracting relevant keywords. Knowl.-Based Syst. 229, 107342 (2021). https://doi.org/10.1016/j.knosys.2021.107342
Sichel, H.: On a distribution law for word frequencies. J. Am. Stat. Assoc. 70, 542–547 (1975)
Starosta, B., Kłopotek, M., Wierzchoń, S.: Hashtag similarity based on laplacian eigenvalue spectrum. In: Proceedings of PP-RAI 2023 - 4th Polish Conference on Artificial Intelligence, Progress in Polish Artificial Intelligence Research 4, Łódź, Poland 2023 (2023)
Wierzchoń, S., Kłopotek, M.: Modern Clustering Algorithms. Studies in Big Data, vol. 34. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-319-69308-8
Xu, Y., Srinivasan, A., Xue, L.: A selective overview of recent advances in spectral clustering and their applications. In: Zhao, Y., Chen, D.D.-G. (eds.) Modern Statistical Methods for Health Research. ETSB, pp. 247–277. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72437-5_12
Zhao, Y., Liang, S., Ren, Z., Ma, J., Yilmaz, E., de Rijke, M.: Explainable user clustering in short text streams. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, pp. 155–164. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2911451.2911522
Zipf, G.: Selective Studies and the Principle of Relative Frequency in Language. Harvard University Press, Cambridge (1932)
Acknowledgments
This study was funded by Polish Ministry of Science.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kłopotek, M.A., Wierzchoń, S.T., Starosta, B., Czerski, D., Borkowski, P. (2024). Towards Explaining the Spectrogram of Graph Spectral Clustering in Text Document Domain. In: Saeed, K., Dvorský, J. (eds) Computer Information Systems and Industrial Management. CISIM 2024. Lecture Notes in Computer Science, vol 14902. Springer, Cham. https://doi.org/10.1007/978-3-031-71115-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-71115-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71114-5
Online ISBN: 978-3-031-71115-2
eBook Packages: Computer ScienceComputer Science (R0)