Abstract
The analysis of an individual’s genetic material may uncover genetic variants, which can be classified as disease-causing (pathogenic) or benign. Identifying pathogenic variants among millions of variants relies on the research of evidence in support of or against variant pathogenicity, a process regulated by the American College of Molecular Genetics (ACMG) guidelines, which leverages data from the scientific literature. Despite recent improvements towards automation, searching shreds of evidence for pathogenicity in the literature still requires manual curation, a time-consuming process, due to the ever-growing number of published papers.
In this work, we built DAVI (Dataset for Automatic Variant Interpretation), a reliable, manually curated dataset comprising articles both containing (positive) and not containing (negative) evidence activating two opposing ACGM criteria, namely PS3 and BS3, for a pool of 41 variants. Moreover, we demonstrated that DAVI can be used to train a predictive model that automatically identifies positive (variant, article) associations.
DAVI contains 311 (variant, article) pairs: 154 positive and 157 negative associations. We used three different text representation models combined with a logistic regression to efficiently identify positive associations, with an F1-score of 0.84. The model’s performance constitutes a clear proof of concept for automatic PS3/BS3 evidence identification. DAVI represents a useful resource to train further models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Collins, F.S., Fink, L.: The human genome project. Alcohol Health Res. World 19(3), 190–195 (1995)
Morash, M., Mitchell, H., Beltran, H., Elemento, O., Pathak, J.: The role of next-generation sequencing in precision medicine: a review of outcomes in oncology. J. Personalized Med. 8(3), 30 (2018)
Amendola, L.M., et al.: Performance of ACMG-AMP variant-interpretation guidelines among nine laboratories in the clinical sequencing exploratory research consortium. Am. J. Hum. Genet. 98(6), 1067–1076 (2016)
Richards, S., et al.: Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American college of medical genetics and genomics and the association for molecular pathology. Genet. Med. 17(5), 405–424 (2015)
GVACI Course 2022. https://gvaci.genomes.in/home. Accessed 29 Dec 2022
Lee, K., Wei, C.-H., Lu, Z.: Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 22(3), bbaa142 (2020)
Welcome to ClinGen. https://www.clinicalgenome.org/. Accessed 29 Dec 2022
McLaren, W., et al.: The ensembl variant effect predictor. Genome Biol 17, 1–14 (2016)
Wang, K., Li, M., Hakonarson, H.: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38(16), e164 (2010)
Den Dunnen, J.T., et al.: HGVS Recommendations for the description of sequence variants: 2016 update. Hum. Mutat. 37(6), 564–569 (2016)
Karczewski, K.J., et al.: The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic Acids Res. 45, D840-D 845 (2017)
PubMed. https://pubmed.ncbi.nlm.nih.gov/. Accessed 03 Jan 2023
Home - PMC – NCBI. https://www.ncbi.nlm.nih.gov/pmc/. Accessed 03 Jan 2023
Levchenko, M., et al.: Europe PMC in 2017. Nucleic Acids Res. 46, D1254–D1260 (2017)
RefSeq: NCBI Reference Sequence Database. https://www.ncbi.nlm.nih.gov/refseq/. Accessed 03 May 2023
Chunn, L.M., et al.: Mastermind: a comprehensive genomic association search engine for empirical evidence curation and genetic variant interpretation. Front Genet 11, 577152 (2020)
Stubben, C.: tidypmc: Parse Full Text XML Documents from PubMed Central. (2019)
Kathuria, A., Gupta, A., Singla, R.K.: A review of tools and techniques for preprocessing of textual data. In: Singh, V., Asari, V.K., Kumar, S., Patel, R.B. (eds.) Computational Methods and Data Engineering. AISC, vol. 1227, pp. 407–422. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-6876-3_31
Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., (2009)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Qader, W. A., Ameen, M. M., Ahmed, B. I.: An overview of bag of words; importance, implementation, applications, and challenges. In: International Engineering Conference (IEC) 2019, pp. 200–204, (2019)
Berrar, D.: Cross-Validation (2018)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res 13, 281–305 (2012)
Keilwagen, I.G., Grau, J.: Area under precision-recall curves for weighted and unweighted data. PLoS ONE 9(3), e92209 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Longhin, F., Guazzo, A., Longato, E., Ferro, N., Di Camillo, B. (2023). DAVI: A Dataset for Automatic Variant Interpretation. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. Lecture Notes in Computer Science, vol 14163. Springer, Cham. https://doi.org/10.1007/978-3-031-42448-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-42448-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42447-2
Online ISBN: 978-3-031-42448-9
eBook Packages: Computer ScienceComputer Science (R0)