{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,22]],"date-time":"2025-02-22T00:45:29Z","timestamp":1740185129472,"version":"3.37.3"},"reference-count":53,"publisher":"Oxford University Press (OUP)","issue":"14","license":[{"start":{"date-parts":[[2019,7,8]],"date-time":"2019-07-08T00:00:00Z","timestamp":1562544000000},"content-version":"vor","delay-in-days":7,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"name":"IBM Research AI through the AI Horizons Network"},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["IIS-1565862","III-1845967"],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,7,15]]},"abstract":"Abstract<\/jats:title>Motivation<\/jats:title>Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data are high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data are often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks.<\/jats:p><\/jats:sec>Results<\/jats:title>In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes.<\/jats:p><\/jats:sec>Availability and implementation<\/jats:title>TADA is available at https:\/\/github.com\/tada-alg\/TADA.<\/jats:p><\/jats:sec>Supplementary information<\/jats:title>Supplementary data are available at Bioinformatics online.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btz394","type":"journal-article","created":{"date-parts":[[2019,5,23]],"date-time":"2019-05-23T11:23:07Z","timestamp":1558610587000},"page":"i31-i40","source":"Crossref","is-referenced-by-count":10,"title":["TADA: phylogenetic augmentation of microbiome samples enhances phenotype classification"],"prefix":"10.1093","volume":"35","author":[{"given":"Erfan","family":"Sayyari","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA, USA"}]},{"given":"Ban","family":"Kawas","sequence":"additional","affiliation":[{"name":"IBM Research\u2014Almaden Research Center, San Jose, CA, USA"}]},{"given":"Siavash","family":"Mirarab","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA, USA"}]}],"member":"286","published-online":{"date-parts":[[2019,7,5]]},"reference":[{"key":"2023062712343918000_btz394-B1","doi-asserted-by":"crossref","first-page":"e36466.","DOI":"10.1371\/journal.pone.0036466","article-title":"A metagenomic approach to characterization of the vaginal microbiome signature in pregnancy","volume":"7","author":"Aagaard","year":"2012","journal-title":"PLoS One"},{"key":"2023062712343918000_btz394-B2","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1111\/j.2517-6161.1982.tb01195.x","article-title":"The statistical analysis of compositional data","volume":"44","author":"Aitchison","year":"1982","journal-title":"J. R. Stat. Soc. Series B (Methodol.)"},{"key":"2023062712343918000_btz394-B3","doi-asserted-by":"crossref","first-page":"271","DOI":"10.1023\/A:1007529726302","article-title":"Logratio analysis and compositional distance","volume":"32","author":"Aitchison","year":"2000","journal-title":"Math. Geol"},{"key":"2023062712343918000_btz394-B4","doi-asserted-by":"crossref","first-page":"e1004186.","DOI":"10.1371\/journal.pcbi.1004186","article-title":"Explaining diversity in metagenomic datasets by\u00a0phylogenetic-based feature weighting","volume":"11","author":"Albanese","year":"2015","journal-title":"PLoS Comput. Biol"},{"key":"2023062712343918000_btz394-B5","doi-asserted-by":"crossref","DOI":"10.1128\/mSystems.00191-16","article-title":"Deblur rapidly resolves single-nucleotide community sequence patterns","volume":"2","author":"Amir","year":"2017","journal-title":"mSystems"},{"key":"2023062712343918000_btz394-B6","article-title":"K-means++: the advantages of careful seeding","author":"Arthur","year":"2007","journal-title":"Proceedings of ACM-SIAM Symposium on Discrete Algorithms"},{"key":"2023062712343918000_btz394-B7","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1007\/BF01441146","article-title":"A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity","volume":"96","author":"Balding","year":"1995","journal-title":"Genetica"},{"key":"2023062712343918000_btz394-B8","doi-asserted-by":"crossref","first-page":"e87830.","DOI":"10.1371\/journal.pone.0087830","article-title":"Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics","volume":"9","author":"Beck","year":"2014","journal-title":"PLoS One"},{"key":"2023062712343918000_btz394-B29","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random Forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Machine Learning"},{"key":"2023062712343918000_btz394-B9","doi-asserted-by":"crossref","first-page":"581","DOI":"10.1038\/nmeth.3869","article-title":"DADA2: high-resolution sample inference from Illumina amplicon data","volume":"13","author":"Callahan","year":"2016","journal-title":"Nat. Methods"},{"key":"2023062712343918000_btz394-B10","doi-asserted-by":"crossref","first-page":"R50.","DOI":"10.1186\/gb-2011-12-5-r50","article-title":"Moving pictures of the human microbiome","volume":"12","author":"Caporaso","year":"2011","journal-title":"Genome Biol"},{"key":"2023062712343918000_btz394-B11","first-page":"875","volume-title":"Data Mining and Knowledge Discovery Handbook","author":"Chawla","year":"2010"},{"key":"2023062712343918000_btz394-B12","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1613\/jair.953","article-title":"SMOTE: synthetic minority over-sampling technique","volume":"16","author":"Chawla","year":"2002","journal-title":"J. Artif. Intell. Res"},{"key":"2023062712343918000_btz394-B13","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1016\/j.trsl.2012.05.003","article-title":"The human gut microbiome: current knowledge, challenges, and future directions","volume":"160","author":"Dave","year":"2012","journal-title":"Transl. Res"},{"key":"2023062712343918000_btz394-B14","doi-asserted-by":"crossref","first-page":"5069","DOI":"10.1128\/AEM.03006-05","article-title":"Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB","volume":"72","author":"DeSantis","year":"2006","journal-title":"Appl. Environ. Microbiol"},{"key":"2023062712343918000_btz394-B15","doi-asserted-by":"crossref","first-page":"2460","DOI":"10.1093\/bioinformatics\/btq461","article-title":"Search and clustering orders of magnitude faster than BLAST","volume":"26","author":"Edgar","year":"2010","journal-title":"Bioinformatics"},{"key":"2023062712343918000_btz394-B16","article-title":"UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing","author":"Edgar","year":"2016","journal-title":"bioRxiv"},{"key":"2023062712343918000_btz394-B17","doi-asserted-by":"crossref","first-page":"6528.","DOI":"10.1038\/ncomms7528","article-title":"Gut microbiome development along the colorectal adenoma-carcinoma sequence","volume":"6","author":"Feng","year":"2015","journal-title":"Nat. Commun"},{"key":"2023062712343918000_btz394-B18","doi-asserted-by":"crossref","first-page":"531.","DOI":"10.1186\/s13059-014-0531-y","article-title":"Temporal variability is a personalized feature of the human microbiome","volume":"15","author":"Flores","year":"2014","journal-title":"Genome Biol"},{"key":"2023062712343918000_btz394-B19","doi-asserted-by":"crossref","first-page":"382","DOI":"10.1016\/j.chom.2014.02.005","article-title":"The treatment-naive microbiome in new-onset Crohn\u2019s disease","volume":"15","author":"Gevers","year":"2014","journal-title":"Cell Host and Microbe"},{"key":"2023062712343918000_btz394-B20","doi-asserted-by":"crossref","first-page":"1355","DOI":"10.1126\/science.1124234","article-title":"Metagenomic analysis of the human distal gut microbiome","volume":"312","author":"Gill","year":"2006","journal-title":"Science"},{"key":"2023062712343918000_btz394-B21","doi-asserted-by":"crossref","first-page":"796","DOI":"10.1038\/s41592-018-0141-9","article-title":"Qiita: rapid, web-enabled microbiome meta-analysis","volume":"15","author":"Gonzalez","year":"2018","journal-title":"Nat. Methods"},{"key":"2023062712343918000_btz394-B22","first-page":"1322","article-title":"ADASYN: adaptive synthetic sampling approach for imbalanced learning","volume-title":"2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)","author":"He","year":"2008"},{"key":"2023062712343918000_btz394-B23","doi-asserted-by":"crossref","first-page":"00021","DOI":"10.1128\/mSystems.00021-18","article-title":"Phylogenetic placement of exact amplicon sequences improves associations with clinical information","volume":"3","author":"Janssen","year":"2018","journal-title":"mSystems"},{"key":"2023062712343918000_btz394-B24","doi-asserted-by":"crossref","first-page":"292","DOI":"10.1016\/j.chom.2011.09.003","article-title":"Human-associated microbial signatures: examining their predictive value","volume":"10","author":"Knights","year":"2011","journal-title":"Cell Host Microbe"},{"volume-title":"Proceedings of the 14th International Conference on Machine Learning","year":"1997","author":"Kubat","key":"2023062712343918000_btz394-B25"},{"key":"2023062712343918000_btz394-B26","doi-asserted-by":"crossref","first-page":"814","DOI":"10.1038\/nbt.2676","article-title":"Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences","volume":"31","author":"Langille","year":"2013","journal-title":"Nat. Biotechnol"},{"key":"2023062712343918000_btz394-B27","doi-asserted-by":"crossref","first-page":"733","DOI":"10.1038\/nrg2825","article-title":"Tackling the widespread and critical impact of batch effects in high-throughput data","volume":"11","author":"Leek","year":"2010","journal-title":"Nat. Rev. Genet"},{"key":"2023062712343918000_btz394-B28","first-page":"1","article-title":"Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning","volume":"18","author":"Lema\u00eetre","year":"2017","journal-title":"J. Mach. Learn. Res"},{"key":"2023062712343918000_btz394-B30","doi-asserted-by":"crossref","first-page":"8228","DOI":"10.1128\/AEM.71.12.8228-8235.2005","article-title":"UniFrac: a new phylogenetic method for comparing microbial communities","volume":"71","author":"Lozupone","year":"2005","journal-title":"Appl. Environ. Microbiol"},{"key":"2023062712343918000_btz394-B31","doi-asserted-by":"crossref","first-page":"1576","DOI":"10.1128\/AEM.01996-06","article-title":"Quantitative and qualitative \u03b2 diversity measures lead to different insights into factors that structure microbial communities","volume":"73","author":"Lozupone","year":"2007","journal-title":"Appl. Environ. Microbiol"},{"key":"2023062712343918000_btz394-B32","doi-asserted-by":"crossref","first-page":"e26","DOI":"10.1093\/sysbio\/syu053","article-title":"Phylogenetics and the human microbiome","volume":"64","author":"Matsen","year":"2015","journal-title":"Syst. Biol"},{"key":"2023062712343918000_btz394-B33","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1186\/2047-217X-1-7","article-title":"The biological observation matrix (BIOM) format or: how I learned to stop worrying and love the ome\u2013ome","volume":"1","author":"McDonald","year":"2012","journal-title":"Gigascience"},{"key":"2023062712343918000_btz394-B34","doi-asserted-by":"crossref","DOI":"10.1128\/mSystems.00031-18","article-title":"American gut: an open platform for citizen science microbiome research","volume":"3","author":"McDonald","year":"2018","journal-title":"mSystems"},{"key":"2023062712343918000_btz394-B35","doi-asserted-by":"crossref","first-page":"e1003531.","DOI":"10.1371\/journal.pcbi.1003531","article-title":"Waste not, want not: why rarefying microbiome data is inadmissible","volume":"10","author":"McMurdie","year":"2014","journal-title":"PLoS Comput. Biol"},{"key":"2023062712343918000_btz394-B36","first-page":"247","volume-title":"Biocomputing 2012","author":"Mirarab","year":"2012"},{"key":"2023062712343918000_btz394-B37","doi-asserted-by":"crossref","DOI":"10.1128\/mSystems.00162-16","article-title":"Balance trees reveal microbial niche differentiation","volume":"2","author":"Morton","year":"2017","journal-title":"mSystems"},{"volume-title":"The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet","year":"2007","key":"2023062712343918000_btz394-B38"},{"key":"2023062712343918000_btz394-B39","doi-asserted-by":"crossref","first-page":"3548","DOI":"10.1093\/bioinformatics\/btu721","article-title":"TIPP: taxonomic identification and phylogenetic profiling","volume":"30","author":"Nguyen","year":"2014","journal-title":"Bioinformatics"},{"key":"2023062712343918000_btz394-B40","doi-asserted-by":"crossref","first-page":"e1002832","DOI":"10.1371\/journal.pcbi.1002832","article-title":"Phylogenetic diversity theory sheds light on the structure of microbial communities","volume":"8","author":"O\u2019Dwyer","year":"2012","journal-title":"PLoS Comput. Biol"},{"key":"2023062712343918000_btz394-B41","doi-asserted-by":"crossref","first-page":"1200","DOI":"10.1038\/nmeth.2658","article-title":"Differential abundance analysis for microbial marker-gene surveys","volume":"10","author":"Paulson","year":"2013","journal-title":"Nat. Methods"},{"key":"2023062712343918000_btz394-B42","first-page":"2825","article-title":"Scikit-learn: machine learning in {P}ython","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res"},{"key":"2023062712343918000_btz394-B43","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"ImageNet large scale visual recognition challenge","volume":"115","author":"Russakovsky","year":"2015","journal-title":"Int. J. Comput. Vis"},{"key":"2023062712343918000_btz394-B44","doi-asserted-by":"crossref","first-page":"1782","DOI":"10.1053\/j.gastro.2011.06.072","article-title":"Gastrointestinal microbiome signatures of pediatric patients with irritable bowel syndrome","volume":"141","author":"Saulnier","year":"2011","journal-title":"Gastroenterology"},{"key":"2023062712343918000_btz394-B45","doi-asserted-by":"crossref","first-page":"1501","DOI":"10.1128\/AEM.71.3.1501-1506.2005","article-title":"Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness","volume":"71","author":"Schloss","year":"2005","journal-title":"Appl. Environ. Microbiol"},{"key":"2023062712343918000_btz394-B46","doi-asserted-by":"crossref","first-page":"11.","DOI":"10.1186\/2049-2618-1-11","article-title":"A comprehensive evaluation of multicategory classification methods for microbiomic data","volume":"1","author":"Statnikov","year":"2013","journal-title":"Microbiome"},{"key":"2023062712343918000_btz394-B47","doi-asserted-by":"crossref","first-page":"1569","DOI":"10.1093\/bioinformatics\/btq228","article-title":"DendroPy: a Python library for phylogenetic computing","volume":"26","author":"Sukumaran","year":"2010","journal-title":"Bioinformatics"},{"key":"2023062712343918000_btz394-B48","doi-asserted-by":"crossref","DOI":"10.1128\/mBio.01018-16","article-title":"Looking for a signal in the noise: revisiting obesity and the microbiome","volume":"7","author":"Sze","year":"2016","journal-title":"mBio"},{"key":"2023062712343918000_btz394-B49","doi-asserted-by":"crossref","first-page":"804","DOI":"10.1038\/nature06244","article-title":"The human microbiome project","volume":"449","author":"Turnbaugh","year":"2007","journal-title":"Nature"},{"key":"2023062712343918000_btz394-B50","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1126\/science.1093857","article-title":"Environmental genome shotgun sequencing of the Sargasso Sea","volume":"304","author":"Venter","year":"2004","journal-title":"Science"},{"key":"2023062712343918000_btz394-B51","doi-asserted-by":"crossref","first-page":"1126","DOI":"10.1126\/science.1133420","article-title":"Quantitative phylogenetic assessment of microbial communities in diverse environments","volume":"315","author":"von Mering","year":"2007","journal-title":"Science"},{"key":"2023062712343918000_btz394-B52","doi-asserted-by":"crossref","first-page":"e1002050.","DOI":"10.1371\/journal.pbio.1002050","article-title":"Where next for microbiome research?","volume":"13","author":"Waldor","year":"2015","journal-title":"PLoS Biol"},{"key":"2023062712343918000_btz394-B53","doi-asserted-by":"crossref","first-page":"564.","DOI":"10.1186\/s13059-014-0564-2","article-title":"Tracking down the sources of experimental contamination in microbiome studies","volume":"15","author":"Weiss","year":"2014","journal-title":"Genome Biol"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/14\/i31\/50720951\/bioinformatics_35_14_i31.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/35\/14\/i31\/50720951\/bioinformatics_35_14_i31.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,18]],"date-time":"2024-07-18T13:58:56Z","timestamp":1721311136000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/35\/14\/i31\/5529256"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,7]]},"references-count":53,"journal-issue":{"issue":"14","published-print":{"date-parts":[[2019,7,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btz394","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"type":"print","value":"1367-4803"},{"type":"electronic","value":"1367-4811"}],"subject":[],"published-other":{"date-parts":[[2019,7]]},"published":{"date-parts":[[2019,7]]}}}