Abstract
We propose a new document classification method, bridging discrepancies (so-called semantic gap) between the training set and the application sets of textual data. We demonstrate its superiority over classical text classification approaches, including traditional classifier ensembles. The method consists of combining a document categorization technique with a single classifier or a classifier ensemble.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Aas, K., Eikvil, L.: Text categorisation: a survey. Report No. 941, June 1999
Bekkerman, R.: Automatic categorization of email into folders: benchmark experiments on enron and SRI corpora. Tech. rep, UMass CIIR (2004)
Borkowski, P.: Methods of semantic categorization in the task of text document analysis. Ph.D. thesis, Institute of Computer Science of Polish Academy of Sciences, Warsaw (2019, in Polish)
Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: Sheth, A., et al. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 229–244. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88564-1_15
Ciesielski, K., Borkowski, P., Kłopotek, M.A., Trojanowski, K., Wysocki, K.: Wikipedia-based document categorization. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) SIIS 2011. LNCS, vol. 7053, pp. 265–278. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-25261-7_21
Jacob, E.: Classification and categorization: a difference that makes a difference. Libr. Trends 52, 515–540 (2004)
Littlestone, N.: Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Mach. Learn. 2, 285–318 (1988)
Medelyan, O., Witten, I.H., Milne, D.: Topic indexing with wikipedia. In: Proceedings of the First AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008) (2008)
Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of CIKM 2007, Lisbon, Portugal, pp. 233–242. ACM (2007)
Milne, D., Witten, I.H.: An open-source toolkit for mining wikipedia. Artif. Intell. 194, 222–239 (2013)
Milne, D.N., Witten, I.H.: Learning to link with wikipedia. In: Proceedings of CIKM 2008, pp. 509–518. ACM (2008)
Nguyen, C.T.: Bridging semantic gaps in information retrieval: context-based approaches. In: ACM VLDB 2010 (2010)
Rafi, M., Hassan, S., Shaikh, M.S.: Content-based text categorization using wikitology. CoRR abs/1208.3623 (2012)
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on EMNLP: Volume 1, EMNLP 2009, pp. 248–256. Association for Computational Linguistics (2009)
Ramakrishna Murty, M., Murthy, J., Prasad Reddy, P., Satapathy, S.: A survey of cross-domain text categorization techniques. In: RAIT 2012, pp. 499–504. IEEE (2012)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comp. Surv. 34(1), 1–47 (2002)
Sebastiani, F.: Text categorization. In: Text Mining and its Applications to Intelligence, CRM and Knowledge Management, pp. 109–129. WIT Press (2005)
Seppänen, J.K., Bingham, E., Mannila, H.: A simple algorithm for topic identification in 0–1 data. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 423–434. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39804-2_38
Wang, P., Domeniconi, C., Hu, J.: Using wikipedia for co-clustering based cross-domain text classification. In: ICDM 2008, pp. 1085–1090. IEEE (2008)
Wróblewska, A., Sydow, M.: DEBORA: dependency-based method for extracting entity-relationship triples from open-domain texts in Polish. In: Chen, L., Felfernig, A., Liu, J., Raś, Z.W. (eds.) ISMIS 2012. LNCS (LNAI), vol. 7661, pp. 155–161. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34624-8_19
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Borkowski, P., Ciesielski, K., Kłopotek, M.A. (2020). Semantic Classifier Approach to Document Classification. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2020. Lecture Notes in Computer Science(), vol 12415. Springer, Cham. https://doi.org/10.1007/978-3-030-61401-0_61
Download citation
DOI: https://doi.org/10.1007/978-3-030-61401-0_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61400-3
Online ISBN: 978-3-030-61401-0
eBook Packages: Computer ScienceComputer Science (R0)