Automatic Detection of Parallel Sentences from Comparable Biomedical Texts

Cardon, Rémi; Grabar, Natalia

doi:10.1007/978-3-031-24337-0_41

Rémi Cardon^8,9 &
Natalia Grabar^8,9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13451))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

417 Accesses

Abstract

Parallel sentences provide semantically similar information which can vary on a given dimension, such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We treat this task as binary classification (alignment/non-alignment). We perform experiments on balanced and imbalanced data. The results on balanced data reach up to 0.96 F-Measure. On imbalanced data, the results are lower but remain competitive when using classification models train on balanced data. Besides, among the three datasets exploited (semantic equivalence and inclusions), the detection of equivalence pairs is more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward

Aligning Sentences Between Comparable Texts of Different Styles

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Notes

References

Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Conference Proceedings: The Tenth Machine Translation Summit, pp. 79–86. Phuket, Thailand, AAMT, AAMT (2005)
Google Scholar
Vu, T.T., Tran, G.B., Pham, S.B.: Learning to simplify children stories with limited data. In: Nguyen, N.T., Attachoo, B., Trawiński, B., Somboonviwat, K. (eds.) ACIIDS 2014. LNCS (LNAI), vol. 8397, pp. 31–41. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05476-6_4
Chapter Google Scholar
Paetzold, G.H., Specia, L.: Benchmarking lexical simplification systems. In: LREC, pp. 3074–3080 (2016)
Google Scholar
Chen, P., Rochford, J., Kennedy, D.N., Djamasbi, S., Fay, P., Scott, W.: Automatic text simplification for people with intellectual disabilities. In: AIST, pp. 1–9 (2016)
Google Scholar
Leroy, G., Kauchak, D., Mouradi, O.: A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty. Int. J. Med. Inform. 82, 717–730 (2013)
Article Google Scholar
AMA: Health literacy: report of the council on scientific affairs. Ad hoc committee on health literacy for the council on scientific affairs, American Medical Association. JAMA, 281, 552–557 (1999)
Google Scholar
Mcgray, A.: Promoting health literacy. J. Am. Med. Inform. Assoc. 12, 152–163 (2005)
Article Google Scholar
Rudd, E.: Needed action in health literacy. J. Health Psychol. 18, 1004–10 (2013)
Article Google Scholar
Grabar, N., Cardon, R.: CLEAR - Simple corpus for medical French. In: Workshop on Automatic Text Adaption (ATA), pp. 1–11 (2018)
Google Scholar
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared task: semantic textual similarity. In: *SEM, pp. 32–43 (2013)
Google Scholar
Agirre, E., et al.: SemEval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability. SemEval 2015, 252–263 (2015)
Google Scholar
Agirre, E., et al.: SemEval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. SemEval 2016, 497–511 (2016)
Google Scholar
Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine translation metrics for paraphrase identification. In: NAACL-HLT, pp. 182–190 (2012)
Google Scholar
Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y.: METER: Measuring text reuse. In: ACL, pp. 152–159 (2002)
Google Scholar
Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Australasian Language Technology Workshop, pp. 160–166 (2005)
Google Scholar
Qiu, L., Kan, M.Y., Chua, T.S.: Paraphrase recognition via dissimilarity significance classification. In: Empirical Methods in Natural Language Processing, pp. 18–26. Sydney, Australia (2006)
Google Scholar
Nelken, R., Shieber, S.M.: Towards robust context-sensitive sentence alignment for monolingual corpora. In: EACL, 161–168 (2006)
Google Scholar
Zhu, Z., Bernhard, D., Gurevych, I.: A monolingual tree-based translation model for sentence simplification. COLING 2010, 1353–1361 (2010)
Google Scholar
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to wordnet: An on-line lexical database. Technical report, WordNet (1993)
Google Scholar
Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: NAACL-HLT, pp. 758–764 (2013)
Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, pp. 1–6 (2006)
Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Comp Ling UK, pp. 1–7 (2008)
Google Scholar
Lai, A., Hockenmaier, J.: Illinois-LH: a denotational and distributional approach to semantics. In: Workshop on Semantic Evaluation (SemEval 2014), pp. 239–334. Dublin, Ireland (2014)
Google Scholar
Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “para-farce” out of paraphrase. In: Australasian Language Technology Workshop, pp. 131–138 (2006)
Google Scholar
Severyn, A., Nicosia, M., Moschitti, A.: Learning semantic textual similarity with structural representations. In: Annual Meeting of the Association for Computational Linguistics, pp. 714–718 (2013)
Google Scholar
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Annual Meeting of the Association for Computational Linguistics, pp. 1556–1566. Beijing, China (2015)
Google Scholar
Tsubaki, M., Duh, K., Shimbo, M., Matsumoto, Y.: Non-linear similarity learning for compositionality. In: AAAI Conference on Artificial Intelligence, pp. 2828–2834 (2016)
Google Scholar
Barzilay, R., Elhadad, N.: Sentence alignment for monolingual comparable corpora. In: EMNLP, pp. 25–32 (2003)
Google Scholar
Guo, W., Diab, M.: Modeling sentences in the latent space. In: ACL, pp. 864–872 (2012)
Google Scholar
Zhao, J., Zhu, T.T., Lan, M.: ECNU: one stone two birds: ensemble of heterogenous measures for semantic relatedness and textual entailment. In: Workshop on Semantic Evaluation (SemEval 2014), pp. 271–277 (2014)
Google Scholar
Kiros, R., et al.: Skip-thought vectors. In: Neural Information Processing Systems (NIPS), pp. 3294–3302 (2015)
Google Scholar
He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with convolutional neural networks. In: EMNLP, pp. 1576–1586. Lisbon, Portugal (2015)
Google Scholar
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: AAAI Conference on Artificial Intelligence, pp. 2786–2792 (2016)
Google Scholar
Sackett, D.L., Rosenberg, W.M.C., MuirGray, J.A., Haynes, R.B., Richardson, W.S.: Evidence based medicine: what it is and what it isn’t. BMJ 312, 71–2 (1996)
Article Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 37–46 (1960)
Article Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet Physics. Doklady, p. 707 (1966)
Google Scholar

Download references

Acknowledgements

This work was funded by the French National Agency for Research (ANR) as part of the CLEAR project (Communication, Literacy, Education, Accessibility, Readability), ANR-17-CE19-0016-01.

Author information

Authors and Affiliations

CNRS, UMR 8163, 59000, Lille, France
Rémi Cardon & Natalia Grabar
Univ. Lille, UMR 8163 - STL - Savoirs Textes Langage, 59000, Lille, France
Rémi Cardon & Natalia Grabar

Authors

Rémi Cardon
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Grabar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rémi Cardon .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cardon, R., Grabar, N. (2023). Automatic Detection of Parallel Sentences from Comparable Biomedical Texts. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_41

Download citation

DOI: https://doi.org/10.1007/978-3-031-24337-0_41
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Detection of Parallel Sentences from Comparable Biomedical Texts

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward

Aligning Sentences Between Comparable Texts of Different Styles

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Automatic Detection of Parallel Sentences from Comparable Biomedical Texts

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward

Aligning Sentences Between Comparable Texts of Different Styles

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation