Abstract
Multilingual document classification is often addressed by approaches that rely on language-specific resources (e.g., bilingual dictionaries and machine translation tools) to evaluate cross-lingual document similarities. However, the required transformations may alter the original document semantics, raising additional issues to the known difficulty of obtaining high-quality labeled datasets. To overcome such issues we propose a new framework for multilingual document classification under a transductive learning setting. We exploit a large-scale multilingual knowledge base, BabelNet, to support the modeling of different language-written documents into a common conceptual space, without requiring any language translation process. We resort to a state-of-the-art transductive learner to produce the document classification. Results on two real-world multilingual corpora have highlighted the effectiveness of the proposed document model w.r.t. document representations usually involved in multilingual and cross-lingual analysis, and the robustness of the transductive setting for multilingual document classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl.-Based Syst. 50, 211–217 (2013)
Barrón-Cedeño, A., Paramita, M.L., Clough, P., Rosso, P.: A comparison of approaches for measuring cross-lingual similarity of wikipedia articles. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 424–429. Springer, Heidelberg (2014)
de Sousa, C.A.R., Rezende, S.O., Batista, G.E.A.P.A.: Influence of graph construction on semi-supervised learning. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part III. LNCS, vol. 8190, pp. 160–175. Springer, Heidelberg (2013)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)
Franco-Salvador, M., Rosso, P., Navigli, R.: A knowledge-based representation for cross-language document retrieval and categorization. In: Proc. EACL, pp. 414–423 (2014)
Guo, Y., Xiao, M.: Transductive representation learning for cross-lingual text classification. In: Proc. ICDM, pp. 888–893 (2012)
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proc. ICML, pp. 200–209 (1999)
Joachims, T.: Transductive Learning via Spectral Graph Partitioning. In: Proc. ICML (2003)
Klementiev, A., Titov, I., Bhattarai, B.: Inducing Crosslingual Distributed Representations of Words. In: Proc. COLING, pp. 1459–1474 (2012)
Liu, W., Chang, S.: Robust multi-class transductive learning with graphs. In: Proc. CVPR, pp. 381–388 (2009)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Mihalcea, R., Tarau, P., Figa, E.: PageRank on semantic networks, with application to word sense disambiguation. In: Proc. COLING (2004)
Navigli, R., Lapata, M.: An experimental study of graph connectivity for unsupervised word sense disambiguation. IEEE TPAMI 32(4), 678–692 (2010)
Navigli, R., Ponzetto, S.P.: Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Navigli, R., Ponzetto, S.P.: Multilingual WSD with just a few lines of code: the babelnet API. In: Proc. ACL, pp. 67–72 (2012)
Ni, X., Sun, J., Hu, J., Chen, Z.: Cross lingual text classification by mining multilingual topics from wikipedia. In: Proc. WSDM, pp. 375–384 (2011)
Romeo, S., Tagarelli, A., Ienco, D.: Semantic-Based Multilingual Document Clustering via Tensor Modeling. In: Proc. EMNLP, pp. 600–609 (2014)
Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual document similarity calculation using the multilingual thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)
Vapnik, V.: Statistical learning theory. Wiley (1998)
Vossen, P.: EuroWordNet: A multilingual database of autonomous and language-specific WordNets connected via an inter-lingual index. International Journal of Lexicography 17(2), 161–173 (2004)
Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: Wikiwalk: Random walks on wikipedia for semantic relatedness. In: Workshop on Graph-based Methods for Natural Language Processing, pp. 41–49 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Romeo, S., Ienco, D., Tagarelli, A. (2015). Knowledge-Based Representation for Transductive Multilingual Document Classification. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham. https://doi.org/10.1007/978-3-319-16354-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-16354-3_11
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16353-6
Online ISBN: 978-3-319-16354-3
eBook Packages: Computer ScienceComputer Science (R0)