Abstract
In this paper we present a simple approach for Vietnamese text classification without word segmentation, based on frequent subgraph mining techniques. A graph-based instead of traditional vector-based model is used for document representation. The classification model employs structural patterns (subgraphs) and Dice measure of similarity to identify a class of documents. This method is evaluated on Vietnamese data set for measuring classification accuracy. Results show that it can outperform k-NN algorithm (based on vector, hybrid document representation) in terms of accuracy and classification time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Proceedings of the Conference on Automated learning and discovery, Workshop 6: Learning from Text and the Web (1998)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of SIGIR, pp. 96–103 (1998)
Dien, D., Kiem, H., Toan, N.V.: Vietnamese Word Segmentation. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 749–756 (2001)
Dominik, A., Walczak, Z., Wojceichowski, J.: Classification of web document using a graph-based model and structural patterns. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 67–78. Springer, Heidelberg (2007)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Gudes, E., Shimony, S.E., Vanetik, N.: Discovering Frequent Graph Patterns using Disjoint Paths. IEEE Transaction on Knowledge and Data Engineering 18(11), 1441–1456 (2006)
Hung, N., Ha, N., Thuc, V., Nghia, T., Kiem, H.: Internet and Genetics Algorithm-based Text Categorization for Documents in Vietnamese. In: Proceedings of 3rd International Conference Research, Innovation and Vision of the Future, pp. 168–172 (2005)
Markov, A., Last, M.: A Simple, Structure-Sensitive Approach for Web Document Classification. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 293–298. Springer, Heidelberg (2005)
Markov, A., Last, M., Kandel, A.: Model-based classification of web documents represented by graphs. In: Proceedings of Workshop on Knowledge Discovery on the Web at KDD, pp. 31–38 (2006)
Masand, B., Linoff, G., Waltz, D.: Classifying news stories using memory based reasoning. In: Proceedings of SIGIR (1992)
Phuc, D.: Document classification using graph model, frequent sub-graphs and Galois lattice. In: Poster Proceedings of 4th International Conference on Computer Science - Research, Innovation and Vision of the Future, pp. 33–38 (2006)
Phuc, D., Phung, N.T.K.: Using Naïve Bayes Model and Natural Language Processing for Classifying Messages on Online Forum. In: Proceedings of IEEE International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 247–252 (2007)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification Of Web Documents Using Graph Matching. International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition 18(3), 475–479 (2004)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communication of ACM 18(11), 613–620 (1992)
Thanh, V.N., Hoang, K.T., Thanh, T.T.N., Hung, N.: Word Segmentation for Vietnamese Text Categorization: An online corpus approach. In: Poster Proceedings of 4th International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 113–118 (2006)
Tomita, J., Nakawatase, H., Ishii, M.: Graph-based Text Database for Knowledge Discovery. In: Proceedings of 13th international World Wide Web conference on Alternate track papers & posters, pp. 454–455 (2004)
Yan, X., Han, J.: gSpan: Graph-Based Substructure Pattern Mining. In: Proceedings of 2002 IEEE International Conference on Data Mining, pp. 721–724 (2002)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM SIGIR, pp. 42–49 (1999)
Vu, C.D.H., Dien, D., Nguyen, L.N., Hung, Q.N.: A Comparative Study on Vietnamese Text Classification Methods. In: Proceedings of IEEE International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 267–273 (2007)
Washito, T., Motoda, H.: State of the art of Graph-Based Data Mining. SIGKDD Exploration 5(1), 59–68 (2003)
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)
Worlein, M., Meinl, T., Fisher, I., Philippsen, M.: A quantative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 392–403. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nguyen, T.A.H., Hoang, K. (2009). Frequent Subgraph-Based Approach for Classifying Vietnamese Text Documents. In: Filipe, J., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2009. Lecture Notes in Business Information Processing, vol 24. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01347-8_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-01347-8_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01346-1
Online ISBN: 978-3-642-01347-8
eBook Packages: Computer ScienceComputer Science (R0)