Automatic Detection of Parallel Sentences from Comparable Biomedical Texts | SpringerLink
Skip to main content

Automatic Detection of Parallel Sentences from Comparable Biomedical Texts

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2019)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13451))

  • 417 Accesses

Abstract

Parallel sentences provide semantically similar information which can vary on a given dimension, such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We treat this task as binary classification (alignment/non-alignment). We perform experiments on balanced and imbalanced data. The results on balanced data reach up to 0.96 F-Measure. On imbalanced data, the results are lower but remain competitive when using classification models train on balanced data. Besides, among the three datasets exploited (semantic equivalence and inclusions), the detection of equivalence pairs is more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://natalia.grabar.free.fr/resources.php#clear.

  2. 2.

    http://base-donnees-publique.medicaments.gouv.fr/.

  3. 3.

    http://www.cochranelibrary.com/.

  4. 4.

    https://fr.wikipedia.org.

  5. 5.

    https://fr.vikidia.org.

  6. 6.

    http://simple.wikipedia.org.

References

  1. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Conference Proceedings: The Tenth Machine Translation Summit, pp. 79–86. Phuket, Thailand, AAMT, AAMT (2005)

    Google Scholar 

  2. Vu, T.T., Tran, G.B., Pham, S.B.: Learning to simplify children stories with limited data. In: Nguyen, N.T., Attachoo, B., Trawiński, B., Somboonviwat, K. (eds.) ACIIDS 2014. LNCS (LNAI), vol. 8397, pp. 31–41. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05476-6_4

    Chapter  Google Scholar 

  3. Paetzold, G.H., Specia, L.: Benchmarking lexical simplification systems. In: LREC, pp. 3074–3080 (2016)

    Google Scholar 

  4. Chen, P., Rochford, J., Kennedy, D.N., Djamasbi, S., Fay, P., Scott, W.: Automatic text simplification for people with intellectual disabilities. In: AIST, pp. 1–9 (2016)

    Google Scholar 

  5. Leroy, G., Kauchak, D., Mouradi, O.: A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty. Int. J. Med. Inform. 82, 717–730 (2013)

    Article  Google Scholar 

  6. AMA: Health literacy: report of the council on scientific affairs. Ad hoc committee on health literacy for the council on scientific affairs, American Medical Association. JAMA, 281, 552–557 (1999)

    Google Scholar 

  7. Mcgray, A.: Promoting health literacy. J. Am. Med. Inform. Assoc. 12, 152–163 (2005)

    Article  Google Scholar 

  8. Rudd, E.: Needed action in health literacy. J. Health Psychol. 18, 1004–10 (2013)

    Article  Google Scholar 

  9. Grabar, N., Cardon, R.: CLEAR - Simple corpus for medical French. In: Workshop on Automatic Text Adaption (ATA), pp. 1–11 (2018)

    Google Scholar 

  10. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared task: semantic textual similarity. In: *SEM, pp. 32–43 (2013)

    Google Scholar 

  11. Agirre, E., et al.: SemEval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability. SemEval 2015, 252–263 (2015)

    Google Scholar 

  12. Agirre, E., et al.: SemEval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. SemEval 2016, 497–511 (2016)

    Google Scholar 

  13. Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine translation metrics for paraphrase identification. In: NAACL-HLT, pp. 182–190 (2012)

    Google Scholar 

  14. Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y.: METER: Measuring text reuse. In: ACL, pp. 152–159 (2002)

    Google Scholar 

  15. Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Australasian Language Technology Workshop, pp. 160–166 (2005)

    Google Scholar 

  16. Qiu, L., Kan, M.Y., Chua, T.S.: Paraphrase recognition via dissimilarity significance classification. In: Empirical Methods in Natural Language Processing, pp. 18–26. Sydney, Australia (2006)

    Google Scholar 

  17. Nelken, R., Shieber, S.M.: Towards robust context-sensitive sentence alignment for monolingual corpora. In: EACL, 161–168 (2006)

    Google Scholar 

  18. Zhu, Z., Bernhard, D., Gurevych, I.: A monolingual tree-based translation model for sentence simplification. COLING 2010, 1353–1361 (2010)

    Google Scholar 

  19. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to wordnet: An on-line lexical database. Technical report, WordNet (1993)

    Google Scholar 

  20. Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: NAACL-HLT, pp. 758–764 (2013)

    Google Scholar 

  21. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, pp. 1–6 (2006)

    Google Scholar 

  22. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Comp Ling UK, pp. 1–7 (2008)

    Google Scholar 

  23. Lai, A., Hockenmaier, J.: Illinois-LH: a denotational and distributional approach to semantics. In: Workshop on Semantic Evaluation (SemEval 2014), pp. 239–334. Dublin, Ireland (2014)

    Google Scholar 

  24. Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “para-farce” out of paraphrase. In: Australasian Language Technology Workshop, pp. 131–138 (2006)

    Google Scholar 

  25. Severyn, A., Nicosia, M., Moschitti, A.: Learning semantic textual similarity with structural representations. In: Annual Meeting of the Association for Computational Linguistics, pp. 714–718 (2013)

    Google Scholar 

  26. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Annual Meeting of the Association for Computational Linguistics, pp. 1556–1566. Beijing, China (2015)

    Google Scholar 

  27. Tsubaki, M., Duh, K., Shimbo, M., Matsumoto, Y.: Non-linear similarity learning for compositionality. In: AAAI Conference on Artificial Intelligence, pp. 2828–2834 (2016)

    Google Scholar 

  28. Barzilay, R., Elhadad, N.: Sentence alignment for monolingual comparable corpora. In: EMNLP, pp. 25–32 (2003)

    Google Scholar 

  29. Guo, W., Diab, M.: Modeling sentences in the latent space. In: ACL, pp. 864–872 (2012)

    Google Scholar 

  30. Zhao, J., Zhu, T.T., Lan, M.: ECNU: one stone two birds: ensemble of heterogenous measures for semantic relatedness and textual entailment. In: Workshop on Semantic Evaluation (SemEval 2014), pp. 271–277 (2014)

    Google Scholar 

  31. Kiros, R., et al.: Skip-thought vectors. In: Neural Information Processing Systems (NIPS), pp. 3294–3302 (2015)

    Google Scholar 

  32. He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with convolutional neural networks. In: EMNLP, pp. 1576–1586. Lisbon, Portugal (2015)

    Google Scholar 

  33. Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: AAAI Conference on Artificial Intelligence, pp. 2786–2792 (2016)

    Google Scholar 

  34. Sackett, D.L., Rosenberg, W.M.C., MuirGray, J.A., Haynes, R.B., Richardson, W.S.: Evidence based medicine: what it is and what it isn’t. BMJ 312, 71–2 (1996)

    Article  Google Scholar 

  35. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 37–46 (1960)

    Article  Google Scholar 

  36. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet Physics. Doklady, p. 707 (1966)

    Google Scholar 

Download references

Acknowledgements

This work was funded by the French National Agency for Research (ANR) as part of the CLEAR project (Communication, Literacy, Education, Accessibility, Readability), ANR-17-CE19-0016-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rémi Cardon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cardon, R., Grabar, N. (2023). Automatic Detection of Parallel Sentences from Comparable Biomedical Texts. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24337-0_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24336-3

  • Online ISBN: 978-3-031-24337-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics