Abstract
Parallel sentences provide semantically similar information which can vary on a given dimension, such as language or register. Parallel sentences with register variation (like expert and non-expert documents) can be exploited for the automatic text simplification. The aim of automatic text simplification is to better access and understand a given information. In the biomedical field, simplification may permit patients to understand medical and health texts. Yet, there is currently no such available resources. We propose to exploit comparable corpora which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are in French and are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We treat this task as binary classification (alignment/non-alignment). We perform experiments on balanced and imbalanced data. The results on balanced data reach up to 0.96 F-Measure. On imbalanced data, the results are lower but remain competitive when using classification models train on balanced data. Besides, among the three datasets exploited (semantic equivalence and inclusions), the detection of equivalence pairs is more efficient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Conference Proceedings: The Tenth Machine Translation Summit, pp. 79–86. Phuket, Thailand, AAMT, AAMT (2005)
Vu, T.T., Tran, G.B., Pham, S.B.: Learning to simplify children stories with limited data. In: Nguyen, N.T., Attachoo, B., Trawiński, B., Somboonviwat, K. (eds.) ACIIDS 2014. LNCS (LNAI), vol. 8397, pp. 31–41. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05476-6_4
Paetzold, G.H., Specia, L.: Benchmarking lexical simplification systems. In: LREC, pp. 3074–3080 (2016)
Chen, P., Rochford, J., Kennedy, D.N., Djamasbi, S., Fay, P., Scott, W.: Automatic text simplification for people with intellectual disabilities. In: AIST, pp. 1–9 (2016)
Leroy, G., Kauchak, D., Mouradi, O.: A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty. Int. J. Med. Inform. 82, 717–730 (2013)
AMA: Health literacy: report of the council on scientific affairs. Ad hoc committee on health literacy for the council on scientific affairs, American Medical Association. JAMA, 281, 552–557 (1999)
Mcgray, A.: Promoting health literacy. J. Am. Med. Inform. Assoc. 12, 152–163 (2005)
Rudd, E.: Needed action in health literacy. J. Health Psychol. 18, 1004–10 (2013)
Grabar, N., Cardon, R.: CLEAR - Simple corpus for medical French. In: Workshop on Automatic Text Adaption (ATA), pp. 1–11 (2018)
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared task: semantic textual similarity. In: *SEM, pp. 32–43 (2013)
Agirre, E., et al.: SemEval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability. SemEval 2015, 252–263 (2015)
Agirre, E., et al.: SemEval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. SemEval 2016, 497–511 (2016)
Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine translation metrics for paraphrase identification. In: NAACL-HLT, pp. 182–190 (2012)
Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y.: METER: Measuring text reuse. In: ACL, pp. 152–159 (2002)
Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: Australasian Language Technology Workshop, pp. 160–166 (2005)
Qiu, L., Kan, M.Y., Chua, T.S.: Paraphrase recognition via dissimilarity significance classification. In: Empirical Methods in Natural Language Processing, pp. 18–26. Sydney, Australia (2006)
Nelken, R., Shieber, S.M.: Towards robust context-sensitive sentence alignment for monolingual corpora. In: EACL, 161–168 (2006)
Zhu, Z., Bernhard, D., Gurevych, I.: A monolingual tree-based translation model for sentence simplification. COLING 2010, 1353–1361 (2010)
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to wordnet: An on-line lexical database. Technical report, WordNet (1993)
Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: NAACL-HLT, pp. 758–764 (2013)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, pp. 1–6 (2006)
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Comp Ling UK, pp. 1–7 (2008)
Lai, A., Hockenmaier, J.: Illinois-LH: a denotational and distributional approach to semantics. In: Workshop on Semantic Evaluation (SemEval 2014), pp. 239–334. Dublin, Ireland (2014)
Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “para-farce” out of paraphrase. In: Australasian Language Technology Workshop, pp. 131–138 (2006)
Severyn, A., Nicosia, M., Moschitti, A.: Learning semantic textual similarity with structural representations. In: Annual Meeting of the Association for Computational Linguistics, pp. 714–718 (2013)
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Annual Meeting of the Association for Computational Linguistics, pp. 1556–1566. Beijing, China (2015)
Tsubaki, M., Duh, K., Shimbo, M., Matsumoto, Y.: Non-linear similarity learning for compositionality. In: AAAI Conference on Artificial Intelligence, pp. 2828–2834 (2016)
Barzilay, R., Elhadad, N.: Sentence alignment for monolingual comparable corpora. In: EMNLP, pp. 25–32 (2003)
Guo, W., Diab, M.: Modeling sentences in the latent space. In: ACL, pp. 864–872 (2012)
Zhao, J., Zhu, T.T., Lan, M.: ECNU: one stone two birds: ensemble of heterogenous measures for semantic relatedness and textual entailment. In: Workshop on Semantic Evaluation (SemEval 2014), pp. 271–277 (2014)
Kiros, R., et al.: Skip-thought vectors. In: Neural Information Processing Systems (NIPS), pp. 3294–3302 (2015)
He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with convolutional neural networks. In: EMNLP, pp. 1576–1586. Lisbon, Portugal (2015)
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: AAAI Conference on Artificial Intelligence, pp. 2786–2792 (2016)
Sackett, D.L., Rosenberg, W.M.C., MuirGray, J.A., Haynes, R.B., Richardson, W.S.: Evidence based medicine: what it is and what it isn’t. BMJ 312, 71–2 (1996)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 37–46 (1960)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet Physics. Doklady, p. 707 (1966)
Acknowledgements
This work was funded by the French National Agency for Research (ANR) as part of the CLEAR project (Communication, Literacy, Education, Accessibility, Readability), ANR-17-CE19-0016-01.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Cardon, R., Grabar, N. (2023). Automatic Detection of Parallel Sentences from Comparable Biomedical Texts. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-24337-0_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)