Diachronic Linguistic Periodization of Temporal Document Collections for Discovering Evolutionary Word Semantics

Duan, Yijun; Jatowt, Adam; Yoshikawa, Masatoshi; Liu, Xin; Matono, Akiyoshi

doi:10.1007/978-3-030-91669-5_1

Yijun Duan¹¹,
Adam Jatowt¹²,
Masatoshi Yoshikawa¹³,
Xin Liu¹¹ &
…
Akiyoshi Matono¹¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13133))

Included in the following conference series:

International Conference on Asian Digital Libraries

1071 Accesses

Abstract

Language is our main communication tool. Deep understanding of its evolution is imperative for many related research areas including history, humanities, social sciences, etc. To this end, we are interested in the task of segmenting long-term document archives into naturally coherent periods based on the evolving word semantics. There are many benefits of such segmentation such as better representation of content in long-term document collections, and support for modeling and understanding semantic drift. We propose a two-step framework for learning time-aware word semantics and periodizing document archive. Encouraging effectiveness of our model is demonstrated on the New York Times corpus spanning from 1990 to 2016.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 12583; Price includes VAT (Japan)

Softcover Book: JPY 15729; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Studying word meaning evolution through incremental semantic shift detection

Article Open access 09 September 2024

Diachronic Corpus Based Word Semantic Variation and Change Mining

DUKweb, diachronic word representations from the UK Web Archive corpus

Article Open access 15 October 2021

Notes

1.
The overall vocabulary V is the union of vocabularies of each time unit, and thus it is possible for some \(w \in V\) to not appear at all in some time units. This includes emerging words and dying words that are typical in real-world news corpora.
2.
These sections are Arts, Business, Fashion & Style, Health, Home & Garden, Real Estate, Science, Sports, Technology, U.S., World.

References

Alemi, A.A., Ginsparg, P.: Text segmentation based on semantic word embeddings. arXiv preprint arXiv:1503.05543 (2015)
Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1–3), 177–210 (1999)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Google Scholar
Campos, R., Dias, G., Jorge, A.M., Jatowt, A.: Survey of temporal information retrieval and related applications. ACM Comput. Surv. (CSUR) 47(2), 1–41 (2014)
Article Google Scholar
Choi, F.Y.: Advances in domain independent linear text segmentation. arXiv preprint cs/0003083 (2000)
Google Scholar
Choi, F.Y., Wiemer-Hastings, P., Moore, J.D.: Latent semantic analysis for text segmentation. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (2001)
Google Scholar
Degaetano-Ortlieb, S., Teich, E.: Using relative entropy for detection and analysis of periods of diachronic linguistic change. In: Proceedings of the Second Joint SIGHUM Workshop, pp. 22–33 (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL 2019, pp. 4171–4186 (2019)
Google Scholar
Du, L., Buntine, W., Johnson, M.: Topic segmentation with a structured topic model. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2013)
Google Scholar
Firth, J.R.: Papers in Linguistics 1934–1951: Repr. Oxford University Press (1961)
Google Scholar
Fragkou, P., Petridis, V., Kehagias, A.: A dynamic programming algorithm for linear text segmentation. J. Intell. Inf. Syst. 23(2), 179–197 (2004)
Article Google Scholar
Gries, S.T., Hilpert, M.: Variability-based neighbor clustering: a bottom-up approach to periodization in historical linguistics (2012)
Google Scholar
Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, pp. 1489–1501. Association for Computational Linguistics, August 2016. https://doi.org/10.18653/v1/P16-1141. https://www.aclweb.org/anthology/P16-1141
Hearst, M.A.: TextTiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Google Scholar
Kulkarni, V., Al-Rfou, R., Perozzi, B., Skiena, S.: Statistically significant detection of linguistic change. In: Proceedings of the 24th International Conference on World Wide Web, pp. 625–635 (2015)
Google Scholar
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)
Google Scholar
Lieberman, E., Michel, J.B., Jackson, J., Tang, T., Nowak, M.A.: Quantifying the evolutionary dynamics of language. Nature 449(7163), 713 (2007)
Article Google Scholar
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Methods Instrum. Comput. 28(2), 203–208 (1996)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
Pagel, M., Atkinson, Q.D., Meade, A.: Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449(7163), 717 (2007)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 19–36 (2002)
Google Scholar
Riedl, M., Biemann, C.: Text segmentation with topic models. J. Lang. Technol. Comput. Linguist. 47–69 (2012)
Google Scholar
Schätzle, C., Booth, H.: DiaHClust: an iterative hierarchical clustering approach for identifying stages in language change. In: Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, pp. 126–135 (2019)
Google Scholar
Sehikh, I., Fohr, D., Illina, I.: Topic segmentation in ASR transcripts using bidirectional RNNs for change detection. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 512–518. IEEE (2017)
Google Scholar
Tahmasebi, N., Borin, L., Jatowt, A.: Survey of computational approaches to diachronic conceptual change. arXiv preprint arXiv:1811.06278 (2018)
Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)
Article MathSciNet Google Scholar
Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 499–506 (2001)
Google Scholar
Yao, Z., Sun, Y., Ding, W., Rao, N., Xiong, H.: Dynamic word embeddings for evolving semantic discovery. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 673–681 (2018)
Google Scholar
Zhang, Y., Jatowt, A., Bhowmick, S., Tanaka, K.: Omnia Mutantur, Nihil Interit: connecting past with present by finding corresponding terms across time. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 645–655 (2015)
Google Scholar

Download references

Acknowledgement

This paper is based on results obtained from a project, JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

Author information

Authors and Affiliations

AIST, Tsukuba, Japan
Yijun Duan, Xin Liu & Akiyoshi Matono
Department of Computer Science, University of Innsbruck, Innsbruck, Austria
Adam Jatowt
Graduate School of Informatics, Kyoto University, Kyoto, Japan
Masatoshi Yoshikawa

Authors

Yijun Duan
View author publications
You can also search for this author in PubMed Google Scholar
Adam Jatowt
View author publications
You can also search for this author in PubMed Google Scholar
Masatoshi Yoshikawa
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Akiyoshi Matono
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yijun Duan .

Editor information

Editors and Affiliations

National Taiwan Normal University, Taipei, Taiwan
Hao-Ren Ke
Nanyang Technological University, Singapore, Singapore
Chei Sian Lee
Kyoto University, Kyoto, Japan
Kazunari Sugiyama

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duan, Y., Jatowt, A., Yoshikawa, M., Liu, X., Matono, A. (2021). Diachronic Linguistic Periodization of Temporal Document Collections for Discovering Evolutionary Word Semantics. In: Ke, HR., Lee, C.S., Sugiyama, K. (eds) Towards Open and Trustworthy Digital Societies. ICADL 2021. Lecture Notes in Computer Science(), vol 13133. Springer, Cham. https://doi.org/10.1007/978-3-030-91669-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-91669-5_1
Published: 30 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91668-8
Online ISBN: 978-3-030-91669-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Diachronic Linguistic Periodization of Temporal Document Collections for Discovering Evolutionary Word Semantics

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Studying word meaning evolution through incremental semantic shift detection

Diachronic Corpus Based Word Semantic Variation and Change Mining

DUKweb, diachronic word representations from the UK Web Archive corpus

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Diachronic Linguistic Periodization of Temporal Document Collections for Discovering Evolutionary Word Semantics

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Studying word meaning evolution through incremental semantic shift detection

Diachronic Corpus Based Word Semantic Variation and Change Mining

DUKweb, diachronic word representations from the UK Web Archive corpus

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation