Abstract
Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another.
Similar content being viewed by others
Notes
References
Agarwal, A., Xie, B., Vovsha, I., Rambow, O., & Passonneau, R. (2011). Sentiment analysis of twitter data. In Proceedings of the workshop on languages in social media (pp. 30–38). Association for Computational Linguistics.
Alegria, I., Aranberri, N., Comas, P. R., Fresno, V., Gamallo, P., Padró, L., San Vicente, I., Turmo, J., & Zubiaga, A. (2014). Tweetnorm\_es corpus: An annotated corpus for spanish microtext normalization. In Proceedings of the language resources and evaluation conference.
Baldwin, T., & Lui, M. (2010). Language identification: The long and the short of the matter. In Human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics (pp. 229–237). Association for Computational Linguistics.
Baykan, E., Henzinger, M., & Weber, I. (2008). Web page language identification based on urls. Proceedings of the VLDB Endowment, 1(1), 176–187.
Beesley, K. R. (1988). Language identifier: A computer program for automatic natural-language identification of on-line text. In Proceedings of the 29th annual conference of the American Translators Association (Vol. 47, p. 54). Citeseer.
Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., & Wilson, T. (2012). Language identification for creating language-specific twitter collections. In Workshop on language in social media (pp. 65–74). ACL.
Brown, R. D. (2012). Finding and identifying text in 900+ languages. Digital Investigation, 9, S34–S43.
Brown, R. D. (2013). Selecting and weighting n-grams to identify 1100 languages. Text, Speech, and Dialogue, 8082, 475–483.
Cárdenas-Claros, M., & Isharyanti, N. (2009). Code-switching and code-mixing in internet chatting: Between ’yes’, ya’, and ’si’—A case study. The Jalt Call Journal, 5(3), 67–78.
Carter, S., Weerkamp, W., & Tsagkias, M. (2013). Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, 47(1), 195–215.
Cassidy, T., Ji, H., Ratinov, L. A., Zubiaga, A., & Huang, H. (2012). Analysis and enhancement of wikification for microblogs with context expansion. In Proceedings of COLING, the 24th international conference on computational linguistics (Vol. 12, pp. 441–456).
Cavnar, W. B., Trenkle, J. M., et al. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2), 161–175.
Chepovskiy, A., Gusev, S., & Kurbatova, M. (2012). Language identification for texts written in transliteration. CDUD 2012—Concept discovery in unstructured data (p. 13).
Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., et al. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49.
Druck, G. (2011). Generalized expectation criteria for lightly supervised learning. Ph.D. thesis, University of Massachusetts, Amherst.
Dunning, T. (1994). Statistical identification of language. Computing Research Laboratory, New Mexico State University, Las Cruces.
Gamallo, P., Garcia, M., Sotelo, S., & Pichel, J. R. (2014). Comparing ranking-based and naive bayes approaches to language detection on tweets. In TweetLID@SEPLN.
Gella, S., Bali, K., & Choudhury, M. (2014). “ye word kis lang ka hai bhai?” Testing the limits of word level language identification. In Proceedings of ICON—2014, the 11th International Conference on Natural Language Processing.
Goldszmidt, M., Najork, M., & Paparizos, S. (2013). Boot-strapping language identifiers for short colloquial postings. In H. Blockeel, K. Kersting, S. Siegfried, F. Železný (Eds.), Machine learning and knowledge discovery in databases, Lecture Notes in Computer Science (Vol. 8189, pp. 95–111). Berlin: Springer.
Gottron, T., & Lipka, N. (2010). A comparison of language identification approaches on short, query-style texts. In C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. Rüger, K. van Rijsbergen (Eds.), Advances in information retrieval, Lecture Notes in Computer Science (Vol. 5993, pp. 611–614). Berlin: Springer.
Grefenstette, G. (1995). Comparing two language identification schemes. International Conference on Statistical Analysis of Textual Data.
Guo, S., Chang, M. W., & Kiciman, E. (2013). To link or not to link? A study on end-to-end tweet entity linking. In HLT-NAACL (pp. 1020–1030).
Hammarström, H. (2007). A finegrained model for language identification. In Proceedings of improving non english web searching (iNEWS’07) (pp. 14–20).
Hughes, B., Baldwin, T., Bird, S. G., Nicholson, J., & MacKinlay, A. (2006). Reconsidering language identification for written language resources. In Proceedings of the 5th International Conference on Language Resources and Evaluation. European Language Resources Association.
Hurtado, L. F., Pla, F., Giménez, M., & Sanchis, E. (2014). Elirf-upv en tweetlid: Identificación del idioma en twitter. In TweetLID@SEPLN.
Ingle, N. (1980). Language identification table. London: Technical Translation International Ltd.
Jehl, L., Hieber, F., & Riezler, S. (2012). Twitter translation using translation-based cross-lingual retrieval. In Proceedings of the seventh workshop on statistical machine translation (pp. 410–421). Association for Computational Linguistics.
Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge: MIT Press.
Kaufmann, M., & Kalita, J. (2010). Syntactic normalization of twitter messages. In International conference on natural language processing. Kharagpur, India.
Keesan, C. (1987). Identification of written slavic languages. In Proceedings of the 28th annual conference of the American Translators Association (pp. 517–528).
Kikui, G. i. (1996). Identifying, the coding system and language, of on-line documents on the internet. In Proceedings of the 16th conference on Computational linguistics (Vol. 2, pp. 652–657). Association for Computational Linguistics.
King, B., & Abney, S. P. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the conference of the North American Chapter of the Association for Computational Linguistics—Human Language Technologies (pp. 1110–1119).
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79–86).
Kouloumpis, E., Wilson, T., & Moore, J. (2011). Twitter sentiment analysis: The good the bad and the omg! In Proceedings of the international conference on weblogs and socila media (pp. 538–541).
Laboreiro, G., Bošnjak, M., Sarmento, L., Rodrigues, E. M., & Oliveira, E. (2013). Determining language variant in microblog messages. In Proceedings of the 28th ACM/SIGAPP symposium on applied computing (pp. 902–907). ACM.
Lehman, B. (2014). The evolution of languages on twitter. http://blog.gnip.com/twitter-language-visualization/.
Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., & Lee, B. S. (2012). Twiner: Named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 721–730). ACM.
Ljubes̆ić, N., Mikelić, N., & Boras, D. (2007). Language indentification: How to distinguish similar languages? In Proceedings of the 29th international conference on information technology interfaces (pp. 541–546). IEEE.
Lui, M., & Baldwin, T. (2010). Multilingual language identification: Altw 2010 shared task dataset. In Australasian Language Technology Association Workshop 2010 (p. 4).
Lui, M., & Baldwin, T. (2011). Cross-domain feature selection for language identification. In Proceedings of 5th international joint conference on natural language processing. Citeseer.
Lui, M., & Baldwin, T. (2012). Langid. py: An off-the-shelf language identification tool. In Proceedings of ACL (pp. 25–30). ACL.
Lui, M., & Baldwin, T. (2014). Accurate language identification of twitter messages. In Proceedings of the 5th workshop on language analysis for social media (LASM) (pp. 17–25). Association for Computational Linguistics, Gothenburg, Sweden.
Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27–40.
Majliš, M. (2012). Yet another language identifier. In Student Research Workshop at EACL’12 (pp. 46–54). ACL.
Martins, B., & Silva, M. J. (2005). Language identification in web pages. In Proceedings of SAC (pp. 764–768). ACM.
McNamee, P. (2005). Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges, 20(3), 94–101.
Mendizabal, I., Carandell, J., & Horowitz, D. (2014). Tweetsafa: Tweet language identification. In TweetLID@SEPLN.
Mosquera, Y. D., Vilares, D., & Vilares, J. (2014). Identificación automática del idioma en twitter: Adaptación de identificadores del estado del arte al contexto ibérico. In TweetLID@SEPLN.
Murthy, K. N., & Kumar, G. B. (2006). Language identification from small text samples. Journal of Quantitative Linguistics, 13(1), 57–80.
Myers-Scotton, C. (2002). Contact linguistics: Bilingual encounters and grammatical outcomes. Oxford: Oxford University Press.
Newman, P. (1987). Foreign language identification: First step in the translation process. Technical report, Sandia National Labs., Albuquerque, NM, USA.
Nguyen, D., & Doğruöz, A.S. (2014). Word level language identification in online multilingual communication. In Proceedings of the conference on empirical methods on natural language processing.
Nowak, S., Lukashevich, H., Dunker, P., & Rüger, S. (2010). Performance measures for multilabel evaluation: A case study in the area of image classification. In Proceedings of the international conference on multimedia information retrieval (pp. 35–44). ACM.
O’Connor, B., Krieger, M., & Ahn, D. (2010). Tweetmotif: Exploratory search and topic summarization for twitter. In ICWSM.
Padró, L., & Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. In Proceedings of the language resources and evaluation conference.
Padró, M., & Padró, L. (2004). Comparing methods for language identification. Procesamiento del lenguaje natural, 33, 155–162.
Paolillo, J. C. (2011). Conversational codeswitching on usenet and internet relay chat. Language@ Internet, 8(3), 1–2.
Porta, J. (2014). Twitter language identification using rational kernels and its potential application to sociolinguistics. In TweetLID@SEPLN.
Prager, J. M. (1999). Linguini: Language identification for multilingual documents. In Proceedings of the 32nd annual Hawaii international conference on systems sciences, 1999 (HICSS-32) (pp. 11–pp). IEEE.
R̆ehůr̆ek, R., & Kolkus, M. (2009). Language identification on the web: Extending the dictionary method. In Computational linguistics and intelligent text processing (pp. 357–368). Springer.
Scannell, K. (2007). The Crúbadán Project: Corpus building for underresourced languages. In Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop, incorporating Cleaneval (Vol. 5, p. 5).
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.
Shuyo, N. (2010). Language detection library for java. https://code.google.com/p/language-detection/.
Sibun, P., & Reynar, J. C. (1996). Language identification: Examining the issues. In Proceedings of SDAIR-96, the 5th Symposium on Document Analysis and Information Retrieval.
Sibun, P., & Spitz, A. L. (1994). Language determination: Natural language processing from scanned document images. In Proceedings of the fourth conference on applied natural language processing (pp. 15–21). Association for Computational Linguistics.
Singh, A. K. (2006). Study of some distance measures for language and encoding identification. In Workshop on linguistic distances (pp. 63–72). ACL.
Singh, A. K., & Goyal, P. (2014). A language identification method applied to twitter data. In TweetLID@SEPLN.
Tetreault, J., Blanchard, D., & Cahill, A. (2013). A report on the first native language identification shared task. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications (pp. 48–57). Citeseer.
Tromp, E., & Pechenizkiy, M. (2011). Graph-based n-gram language identification on short texts. In Proceedings of 20th machine learning conference of Belgium and The Netherlands (pp. 27–34).
Vatanen, T., Väyrynen, J. J., & Virpioja, S. (2010). Language identification of short text segments with n-gram models. In LREC, Citeseer.
Vogel, J., & Tresner-Kirsch, D. (2012). Robust Language Identification in short, noisy texts: Improvements to LIGA. In Proceedings of the 3rd international workshop on mining ubiquitous and social environments (MUSE) (pp. 1–9). Bristol, UK.
Winkelmolen, F., & Mascardi, V. (2011). Statistical language identification of short texts. In Proceedings of the 3rd international conference on agents and artificial intelligence (pp. 498–503). Rome, Italy.
Xafopoulos, A., Kotropoulos, C., Almpanidis, G., & Pitas, I. (2004). Language identification in web documents using discrete hmms. Pattern Recognition, 37(3), 583–594.
Xia, F., Lewis, W. D., & Poon, H. (2009). Language id in the context of harvesting language data off the web. In Proceedings of the 12th conference of the European Chapter of the Association for Computational Linguistics (pp. 870–878). Association for Computational Linguistics.
Zamora, J. D., Bruzón, A. F., & Bueno, R. O. (2014). Tweets language identification using feature weighting. In TweetLID@SEPLN.
Zampieri, M. (2013). Using bag-of-words to distinguish similar languages: How efficient are they? In 2013 IEEE 14th international symposium on computational intelligence and informatics (CINTI) (pp. 37–41). IEEE.
Zubiaga, A., San Vicente, I., Gamallo, P., Pichel, J. R., Alegria, I., Aranberri, N., Ezeiza, A., & Fresno, V. (2014). Overview of tweetlid: Tweet language identification at sepln 2014. In TweetLID@SEPLN.
Zubiaga, A., Spina, D., Amigó, E., & Gonzalo, J. (2012). Towards real-time summarization of scheduled events from twitter streams. In Proceedings of the 23rd ACM conference on hypertext and social media (pp. 319–320). ACM.
Acknowledgments
This work has been supported by the following projects: PHEME FP7 project (grant No. 611233), QTLeap FP7 project (grant No. 610516), Spanish MICINN projects Tacardi (Grant No. TIN2012-38523-C02-01) and Skater (Grant No. TIN2012-38584-C06-01), Galician HPCPLN project (Grant No. EM13/041), Celtic (Innterconecta program, Grant No. 2012-CE138).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zubiaga, A., Vicente, I.S., Gamallo, P. et al. TweetLID: a benchmark for tweet language identification. Lang Resources & Evaluation 50, 729–766 (2016). https://doi.org/10.1007/s10579-015-9317-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9317-4