{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,6]],"date-time":"2024-08-06T08:30:10Z","timestamp":1722933010411},"reference-count":19,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2021,4,29]],"date-time":"2021-04-29T00:00:00Z","timestamp":1619654400000},"content-version":"vor","delay-in-days":118,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["2P41GM103504-11"],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["DP5OD017937"],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["T32GM8806"],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100007631","name":"Canadian Institute for Advanced Research","doi-asserted-by":"publisher","award":["FL-000655"],"id":[{"id":"10.13039\/100007631","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2021,4,29]]},"abstract":"Abstract<\/jats:title>\n High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information\u2019s Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute\u2013value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus\/Species (94.85%), Condition\/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation.<\/jats:p>\n Database URL: https:\/\/github.com\/cartercompbio\/PredictMEE<\/jats:p>","DOI":"10.1093\/database\/baab021","type":"journal-article","created":{"date-parts":[[2021,4,17]],"date-time":"2021-04-17T00:49:18Z","timestamp":1618620558000},"source":"Crossref","is-referenced-by-count":11,"title":["Increasing metadata coverage of SRA BioSample entries using deep learning\u2013based named entity recognition"],"prefix":"10.1093","volume":"2021","author":[{"ORCID":"http:\/\/orcid.org\/0000-0002-7600-3086","authenticated-orcid":false,"given":"Adam","family":"Klie","sequence":"first","affiliation":[{"name":"Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA"},{"name":"Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA"}]},{"given":"Brian Y","family":"Tsui","sequence":"additional","affiliation":[{"name":"Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA"},{"name":"Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA"}]},{"given":"Shamim","family":"Mollah","sequence":"additional","affiliation":[{"name":"Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA"},{"name":"Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA"},{"name":"Department of Genetics, Washington University in St. Louis, St. Louis, MO 63130, USA"}]},{"given":"Dylan","family":"Skola","sequence":"additional","affiliation":[{"name":"Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA"},{"name":"Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA"}]},{"given":"Michelle","family":"Dow","sequence":"additional","affiliation":[{"name":"Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA"},{"name":"Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA"}]},{"given":"Chun-Nan","family":"Hsu","sequence":"additional","affiliation":[{"name":"Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA"},{"name":"Department of Neurosciences, University of California San Diego, La Jolla, CA 92093, USA"}]},{"given":"Hannah","family":"Carter","sequence":"additional","affiliation":[{"name":"Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA"}]}],"member":"286","published-online":{"date-parts":[[2021,4,29]]},"reference":[{"key":"2021070818441226500_R1","doi-asserted-by":"publisher","first-page":"207","DOI":"10.1093\/nar\/30.1.207","article-title":"Gene expression omnibus: NCBI gene expression and hybridization array data repository","volume":"30","author":"Edgar","year":"2002","journal-title":"Nucleic Acids Res."},{"key":"2021070818441226500_R2","doi-asserted-by":"publisher","first-page":"D19","DOI":"10.1093\/nar\/gkq1019","article-title":"The sequence read archive","volume":"39","author":"Leinonen","year":"2011","journal-title":"Nucleic Acids Res."},{"key":"2021070818441226500_R3","doi-asserted-by":"publisher","DOI":"10.1038\/nbt.3838","article-title":"Reproducible RNA-seq analysis using recount2","volume":"35","author":"Collado-Torres","year":"2017","journal-title":"Nat. Biotechnol."},{"key":"2021070818441226500_R4","doi-asserted-by":"publisher","DOI":"10.1038\/nbt.3838","article-title":"Massive mining of publicly available RNA-seq data from human and mouse","volume":"9","author":"Lachmann","year":"2018","journal-title":"Nat. Commun."},{"key":"2021070818441226500_R5","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2019.21","article-title":"The variable quality of metadata about biological samples used in biomedical experiments","volume":"6","author":"Gon\u00e7alves","year":"2019","journal-title":"Sci. Data"},{"key":"2021070818441226500_R6","doi-asserted-by":"publisher","first-page":"D64","DOI":"10.1093\/nar\/gkr937","article-title":"The BioSample database (BioSD) at the European Bioinformatics Institute","volume":"40","author":"Gostev","year":"2012","journal-title":"Nucleic Acids Res."},{"key":"2021070818441226500_R7","doi-asserted-by":"publisher","first-page":"420","DOI":"10.1100\/tsw.2009.57","article-title":"Minimum information about a microarray experiment (MIAME)\u2014successes, failures, challenges","volume":"9","author":"Brazma","year":"2009","journal-title":"Sci. World J."},{"key":"2021070818441226500_R8","doi-asserted-by":"publisher","first-page":"1274","DOI":"10.1038\/ni.3873","article-title":"Adaptive immune receptor reperoire community recommendations for sharing immune-repertoire sequencing data","volume":"18","author":"Rubelt","year":"2017","journal-title":"Nat Immunol."},{"key":"2021070818441226500_R9","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-018-2247-6","article-title":"CEDAR OnDemand: a browser extension to generate ontology-based scientific metadata","volume":"19","author":"Bukhari","year":"2018","journal-title":"BMC Bioinform."},{"key":"2021070818441226500_R10","doi-asserted-by":"publisher","first-page":"D57","DOI":"10.1093\/nar\/gkr1163","article-title":"BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata","volume":"40","author":"Barrett","year":"2012","journal-title":"Nucleic Acids Res."},{"key":"2021070818441226500_R11","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1007\/s12551-018-0490-8","article-title":"Mining data and metadata from the gene expression omnibus","volume":"11","author":"Wang","year":"2018","journal-title":"Biophys. Rev."},{"key":"2021070818441226500_R12","doi-asserted-by":"publisher","first-page":"2914","DOI":"10.1093\/bioinformatics\/btx334","article-title":"MetaSRA: normalized human sample-specific metadata for the sequence read archive","volume":"33","author":"Bernstein","year":"2017","journal-title":"Bioinformatics"},{"key":"2021070818441226500_R13","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-017-1832-4","article-title":"Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata","volume":"18","author":"Hu","year":"2017","journal-title":"BMC Bioinform."},{"key":"2021070818441226500_R14","first-page":"3111","article-title":"Distributed representations of words and phrases and their compositionality","volume":"26","author":"Mikolov","year":"2013","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"2021070818441226500_R15","first-page":"166","article-title":"How to train good word embeddings for biomedical NLP","author":"Chiu","year":"2016"},{"key":"2021070818441226500_R16","article-title":"BERT: pre-training of deep bidirectional transformers for language understanding","volume-title":"arXiv, arXiv:1810.04805","author":"Devlin","year":"2018"},{"key":"2021070818441226500_R17","doi-asserted-by":"crossref","first-page":"602","DOI":"10.1016\/j.neunet.2005.06.042","article-title":"Framewise phoneme classification with bidirectional LSTM and other neural network architectures","volume":"18","author":"Graves","year":"2005","journal-title":"Neural Netw."},{"key":"2021070818441226500_R18","doi-asserted-by":"publisher","first-page":"146","DOI":"10.1007\/978-3-030-21348-0_10","article-title":"Aligning biomedical metadata with ontologies using clustering and embeddings","volume":"11503","author":"Gon\u00e7alves","year":"2019","journal-title":"Semant. Web Lect. Notes Comput. Sci."},{"key":"2021070818441226500_R19","doi-asserted-by":"publisher","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR guiding principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci. Data"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baab021\/37578643\/baab021.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baab021\/37578643\/baab021.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,7,9]],"date-time":"2021-07-09T02:16:29Z","timestamp":1625796989000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baab021\/6259052"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,1,1]]},"references-count":19,"URL":"https:\/\/doi.org\/10.1093\/database\/baab021","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/414136","asserted-by":"object"}]},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021,1,1]]},"published":{"date-parts":[[2021,1,1]]}}}