Cross-Lingual Transfer for Distantly Supervised and Low-Resources Indonesian NER

Ikhwantri, Fariz

doi:10.1007/978-3-031-24337-0_29

Fariz Ikhwantri^8,9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13451))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

Abstract

Manually annotated corpora for low-resource languages are usually small in quantity (gold), or large but distantly supervised (silver). Inspired by recent progress of injecting pre-trained language model (LM) on many Natural Language Processing (NLP) task, we proposed to fine-tune pre-trained language model from high-resources languages to low-resources languages to improve the performance of both scenarios. Our empirical experiment demonstrates significant improvement when fine-tuning pre-trained language model in cross-lingual transfer scenarios for small gold corpus and competitive results in large silver compare to supervised cross-lingual transfer, which will be useful when there is no parallel annotation in the same task to begin. We compare our proposed method of cross-lingual transfer using pre-trained LM to different sources of transfer such as mono-lingual LM and Part-of-Speech tagging (POS) in the downstream task of both large silver and small gold NER dataset by exploiting character-level input of bi-directional language model task.

Work was done when working at Kata.ai.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pre-trained Language Models for Tagalog with Multi-source Data

X-ABI: Toward Parameter-Efficient Multilingual Adapter-Based Inference for Cross-Lingual Transfer

BertOdia: BERT Pre-training for Low Resource Odia Language

Notes

1.
as of 20-08-2018 Wikipedia Database dump.
2.
https://github.com/kata-ai/wikiner.
3.
https://github.com/dice-group/FOX/tree/master/input/Wikiner.
4.
model-checkpoint.
5.
https://github.com/allenai/bilm-tf.

References

Alfina, I., Manurung, R., Fanany, M.I.: DBpedia entities expansion in automatically building dataset for Indonesian NER. In: 2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 335–340 (2016)
Google Scholar
Alfina, I., Savitri, S., Fanany, M.I.: Modified DBpedia entities expansion for tagging automatically NER dataset. In: 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 216–221 (2017)
Google Scholar
Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 10–18. People’s Web 2009, Association for Computational Linguistics, Stroudsburg, PA, USA (2009)
Google Scholar
Blevins, T., Levy, O., Zettlemoyer, L.: Deep RNNs encode soft hierarchical syntax. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 14–19. Association for Computational Linguistics (2018). http://aclweb.org/anthology/P18-2003
Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: EMNLP (2015)
Google Scholar
Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling (2013)
Google Scholar
Cotterell, R., Duh, K.: Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 91–96. Asian Federation of Natural Language Processing (2017). http://aclweb.org/anthology/I17-2016
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363–370. Association for Computational Linguistics (2005). http://www.aclweb.org/anthology/P05-1045
Gardner, M., et al.: AllenNLP: a deep semantic natural language processing platform (2017). arXiv:1803.07640
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Association for Computational Linguistics (2018). http://aclweb.org/anthology/P18-1031
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling (2016). https://arxiv.org/pdf/1602.02410.pdf
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), pp. 2741–2749. AAAI Press (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Google Scholar
Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: Proceedings of the 4th International Conference on Neural Information Processing Systems (NIPS 1991), pp. 950–957. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1991). http://dl.acm.org/citation.cfm?id=2986916.2987033
Kurniawan, K., Aji, A.F.: Toward a standardized and more accurate Indonesian part-of-speech tagging (2018)
Google Scholar
Kurniawan, K., Louvan, S.: Empirical evaluation of character-based model on neural named-entity recognition in Indonesian conversational texts. In: Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-Generated Text, pp. 85–92. Association for Computational Linguistics (2018). http://aclweb.org/anthology/W18-6112
Kurniawan, K., Louvan, S.: Empirical evaluation of character-based model on neural named-entity recognition in Indonesian conversational texts. CoRR abs/1805.12291 (2018). http://arxiv.org/abs/1805.12291
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNS-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1101, http://aclweb.org/anthology/P16-1101
Ni, J., Dinu, G., Florian, R.: Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1470–1480. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1135, http://aclweb.org/anthology/P17-1135
Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into named entity training data. In: 2008 Proceedings of the Australasian Language Technology Association Workshop, pp. 124–132 (2008). http://www.aclweb.org/anthology/U08-1016
Palmer, M., Kingsbury, P., Gildea, D.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31, 71–106 (2005)
Article Google Scholar
Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., Ji, H.: Cross-lingual name tagging and linking for 282 languages. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1946–1958. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1178, http://aclweb.org/anthology/P17-1178
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1162, http://www.aclweb.org/anthology/D14-1162
Perone, C.S., Silveira, R., Paula, T.S.: Evaluation of sentence embeddings in downstream and linguistic probing tasks. CoRR abs/1806.06259 (2018)
Google Scholar
Peters, M., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1756–1765. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1161, http://aclweb.org/anthology/P17-1161
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the NAACL (2018)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/D16-1264, http://www.aclweb.org/anthology/D16-1264
Rashel, F., Luthfi, A., Dinakaramani, A., Manurung, R.: Building an Indonesian rule-based part-of-speech tagger. In: 2014 International Conference on Asian Language Processing (IALP), pp. 70–73 (2014)
Google Scholar
Rei, M.: Semi-supervised multitask learning for sequence labeling. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2121–2130. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1194, http://www.aclweb.org/anthology/P17-1194
Robins, A.V.: Catastrophic forgetting, rehearsal and pseudorehearsal. Connect. Sci. 7, 123–146 (1995)
Article Google Scholar
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. Association for Computational Linguistics (2013). http://www.aclweb.org/anthology/D13-1170
Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks (2015)
Google Scholar
Tala, F.Z.: A study of stemming effects on information retrieval in Bahasa Indonesia. Language and Computation, Universiteit van Amsterdam, The Netherlands, Institute for Logic (2003)
Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: 2003 Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL (CONLL 2003), vol. 4, pp. 142–147. Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Google Scholar
Xie, J., Yang, Z., Neubig, G., Smith, N.A., Carbonell, J.: Neural cross-lingual named entity recognition with minimal resources. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 369–379. Association for Computational Linguistics (2018). http://aclweb.org/anthology/D18-1034
Yang, Z., Salakhutdinov, R., Cohen, W.W.: Transfer learning for sequence tagging with hierarchical recurrent networks. CoRR abs/1703.06345 (2016)
Google Scholar

Download references

Acknowledgments

We also would like to thank Samuel Louvan, Kemal Kurniawan, Adhiguna Kuncoro, and Rezka Aufar L. for reviewing the early version of this work. We are also grateful to Suci Brooks and Pria Purnama for their relentless support.

Author information

Authors and Affiliations

Kata Research Team, Kata.ai, Jakarta, Indonesia
Fariz Ikhwantri
Tokyo Institute of Technology, Tokyo, Japan
Fariz Ikhwantri

Authors

Fariz Ikhwantri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fariz Ikhwantri .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ikhwantri, F. (2023). Cross-Lingual Transfer for Distantly Supervised and Low-Resources Indonesian NER. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-24337-0_29
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Lingual Transfer for Distantly Supervised and Low-Resources Indonesian NER

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Pre-trained Language Models for Tagalog with Multi-source Data

X-ABI: Toward Parameter-Efficient Multilingual Adapter-Based Inference for Cross-Lingual Transfer

BertOdia: BERT Pre-training for Low Resource Odia Language

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Cross-Lingual Transfer for Distantly Supervised and Low-Resources Indonesian NER

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Pre-trained Language Models for Tagalog with Multi-source Data

X-ABI: Toward Parameter-Efficient Multilingual Adapter-Based Inference for Cross-Lingual Transfer

BertOdia: BERT Pre-training for Low Resource Odia Language

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation