DAVI: A Dataset for Automatic Variant Interpretation | SpringerLink
Skip to main content

DAVI: A Dataset for Automatic Variant Interpretation

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2023)

Abstract

The analysis of an individual’s genetic material may uncover genetic variants, which can be classified as disease-causing (pathogenic) or benign. Identifying pathogenic variants among millions of variants relies on the research of evidence in support of or against variant pathogenicity, a process regulated by the American College of Molecular Genetics (ACMG) guidelines, which leverages data from the scientific literature. Despite recent improvements towards automation, searching shreds of evidence for pathogenicity in the literature still requires manual curation, a time-consuming process, due to the ever-growing number of published papers.

In this work, we built DAVI (Dataset for Automatic Variant Interpretation), a reliable, manually curated dataset comprising articles both containing (positive) and not containing (negative) evidence activating two opposing ACGM criteria, namely PS3 and BS3, for a pool of 41 variants. Moreover, we demonstrated that DAVI can be used to train a predictive model that automatically identifies positive (variant, article) associations.

DAVI contains 311 (variant, article) pairs: 154 positive and 157 negative associations. We used three different text representation models combined with a logistic regression to efficiently identify positive associations, with an F1-score of 0.84. The model’s performance constitutes a clear proof of concept for automatic PS3/BS3 evidence identification. DAVI represents a useful resource to train further models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9380
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11725
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://europepmc.org/.

References

  1. Collins, F.S., Fink, L.: The human genome project. Alcohol Health Res. World 19(3), 190–195 (1995)

    Google Scholar 

  2. Morash, M., Mitchell, H., Beltran, H., Elemento, O., Pathak, J.: The role of next-generation sequencing in precision medicine: a review of outcomes in oncology. J. Personalized Med. 8(3), 30 (2018)

    Article  Google Scholar 

  3. Amendola, L.M., et al.: Performance of ACMG-AMP variant-interpretation guidelines among nine laboratories in the clinical sequencing exploratory research consortium. Am. J. Hum. Genet. 98(6), 1067–1076 (2016)

    Article  Google Scholar 

  4. Richards, S., et al.: Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American college of medical genetics and genomics and the association for molecular pathology. Genet. Med. 17(5), 405–424 (2015)

    Article  Google Scholar 

  5. GVACI Course 2022. https://gvaci.genomes.in/home. Accessed 29 Dec 2022

  6. Lee, K., Wei, C.-H., Lu, Z.: Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Brief Bioinform 22(3), bbaa142 (2020)

    Article  Google Scholar 

  7. Welcome to ClinGen. https://www.clinicalgenome.org/. Accessed 29 Dec 2022

  8. McLaren, W., et al.: The ensembl variant effect predictor. Genome Biol 17, 1–14 (2016)

    Article  Google Scholar 

  9. Wang, K., Li, M., Hakonarson, H.: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38(16), e164 (2010)

    Article  Google Scholar 

  10. Den Dunnen, J.T., et al.: HGVS Recommendations for the description of sequence variants: 2016 update. Hum. Mutat. 37(6), 564–569 (2016)

    Article  Google Scholar 

  11. Karczewski, K.J., et al.: The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic Acids Res. 45, D840-D 845 (2017)

    Article  Google Scholar 

  12. PubMed. https://pubmed.ncbi.nlm.nih.gov/. Accessed 03 Jan 2023

  13. Home - PMC – NCBI. https://www.ncbi.nlm.nih.gov/pmc/. Accessed 03 Jan 2023

  14. Levchenko, M., et al.: Europe PMC in 2017. Nucleic Acids Res. 46, D1254–D1260 (2017)

    Article  Google Scholar 

  15. RefSeq: NCBI Reference Sequence Database. https://www.ncbi.nlm.nih.gov/refseq/. Accessed 03 May 2023

  16. Chunn, L.M., et al.: Mastermind: a comprehensive genomic association search engine for empirical evidence curation and genetic variant interpretation. Front Genet 11, 577152 (2020)

    Article  Google Scholar 

  17. Stubben, C.: tidypmc: Parse Full Text XML Documents from PubMed Central. (2019)

    Google Scholar 

  18. Kathuria, A., Gupta, A., Singla, R.K.: A review of tools and techniques for preprocessing of textual data. In: Singh, V., Asari, V.K., Kumar, S., Patel, R.B. (eds.) Computational Methods and Data Engineering. AISC, vol. 1227, pp. 407–422. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-6876-3_31

    Chapter  Google Scholar 

  19. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., (2009)

    Google Scholar 

  20. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  21. Qader, W. A., Ameen, M. M., Ahmed, B. I.: An overview of bag of words; importance, implementation, applications, and challenges. In: International Engineering Conference (IEC) 2019, pp. 200–204, (2019)

    Google Scholar 

  22. Berrar, D.: Cross-Validation (2018)

    Google Scholar 

  23. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res 13, 281–305 (2012)

    MathSciNet  MATH  Google Scholar 

  24. Keilwagen, I.G., Grau, J.: Area under precision-recall curves for weighted and unweighted data. PLoS ONE 9(3), e92209 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Barbara Di Camillo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Longhin, F., Guazzo, A., Longato, E., Ferro, N., Di Camillo, B. (2023). DAVI: A Dataset for Automatic Variant Interpretation. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. Lecture Notes in Computer Science, vol 14163. Springer, Cham. https://doi.org/10.1007/978-3-031-42448-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42448-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42447-2

  • Online ISBN: 978-3-031-42448-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics