Abstract
Metagenomics is the study of genomic sequences in a heterogeneous microbial sample taken, e.g. from soil, water, human microbiome and skin. One of the primary objectives of metagenomic studies is to assign a taxonomic identity to each read sequenced from a sample and then to estimate the abundance of the known clades. With ever-increasing metagenomic datasets obtained from high-throughput sequencing technologies readily available nowadays, several fast and accurate methods have been developed that can work with reasonable computing requirements. Here we provide an overview of the state-of-the-art methods for the classification of metagenomic sequences, especially highlighting theoretical factors that seem to correlate well with practical factors, and could therefore be useful in the choice or development of a new method in experimental contexts. In particular, we emphasize that the information derived from the known genomes and eventually used in the learning and classification processes may create several experimental issues—mostly based on the amount of information used in the processes and its uniqueness, significance, and redundancy,—and some of these issues are intrinsic both in current alignment-based approaches and in compositional ones. This entails the need to develop efficient alignment-free methods that overcome such problems by combining the learning and classification processes in a single framework.
H.A. is supported and M.E. and D.V. are partially supported by the ERASMUS+ KA107 project no. 2018-1-IT02-KA107-047786.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ames, S.K., Hysom, D.A., Gardner, S.N., Lloyd, G.S., Gokhale, M.B., Allen, J.E.: Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29(18), 2253–2260 (2013)
Breitwieser, F., Baker, D., Salzberg, S.L.: KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19(1), 198 (2018)
Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using diamond. Nat. Methods 12(1), 59 (2015)
Comin, M., Verzotto, D.: The irredundant class method for remote homology detection of protein sequences. J. Comput. Biol. 18(12), 1819–1829 (2011)
Comin, M., Verzotto, D.: Comparing, ranking and filtering motifs with character classes: application to biological sequences analysis. In: Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data, chap. 13, pp. 307–332. Wiley (2013)
Comin, M., Verzotto, D.: Filtering degenerate patterns with application to protein sequence analysis. Algorithms 6(2), 352–370 (2013)
Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. IEEE/ACM Trans. Comput. Biol. Bioinf. 11(4), 628–637 (2014)
Comin, M., Verzotto, D.: Alignment-free measures for whole-genome comparison. In: Pattern Recognition in Computational Molecular Biology, chap. 3, pp. 43–64. Wiley (2015)
Freitas, T.A.K., Li, P.E., Scholz, M.B., Chain, P.S.: Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res. 43(10), e69 (2015)
Garofalo, F., Rosone, G., Sciortino, M., Verzotto, D.: The colored longest common prefix array computed via sequential scans. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 153–167. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00479-8_13
Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: Megan analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)
Lam, T.H., Verzotto, D., Liu, J., Nagarajan, N., et al.: Understanding the microbial basis of body odor in pre-pubescent children and teenagers. Microbiome 6, 213 (2018)
Marchiori, D., Comin, M.: SKraken: fast and sensitive classification of short metagenomic reads based on filtering uninformative k-mers. In: BIOINFORMATICS, pp. 59–67 (2017)
McIntyre, A.B., et al.: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 18(1), 182 (2017)
Ounit, R., Lonardi, S.: Higher classification accuracy of short metagenomic reads by discriminative spaced k-mers. In: Pop, M., Touzet, H. (eds.) WABI 2015. LNCS, vol. 9289, pp. 286–295. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48221-6_21
Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015)
Quince, C., Walker, A.W., Simpson, J.T., Loman, N.J., Segata, N.: Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833 (2017)
Saha, S., Johnson, J., Pal, S., Weinstock, G.M., Rajasekaran, S.: MSC: a metagenomic sequence classification algorithm. Bioinformatics, bty1071 (2019)
Teo, A.S., Verzotto, D., Yao, F., Nagarajan, N., Hillmer, A.M.: Single-molecule optical genome mapping of a human HapMap and a colorectal cancer cell line. GigaScience 4, 65 (2015)
Truong, D.T., et al.: Metaphlan2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12(10), 902 (2015)
Vervier, K., Mahé, P., Vert, J.-P.: MetaVW: large-scale machine learning for metagenomics sequence classification. In: Mamitsuka, H. (ed.) Data Mining for Systems Biology. MMB, vol. 1807, pp. 9–20. Springer, New York (2018). https://doi.org/10.1007/978-1-4939-8561-6_2
Verzotto, D., Teo, A.S., Hillmer, A.M., Nagarajan, N.: OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis. GigaScience 5, 2 (2016)
Verzotto, D., Teo, A.S., Hillmer, A.M., Nagarajan, N.: Index-based map-to-sequence alignment in large eukaryotic genomes. In: Proceedings 5th RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq), pp. 1–11. Cold Spring Harbor Labs Journals (2015). https://doi.org/10.1101/017194. bioRxiv 017194
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)
Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Amraoui, H., Elloumi, M., Marcelloni, F., Mhamdi, F., Verzotto, D. (2019). Theoretical and Practical Analyses in Metagenomic Sequence Classification. In: Anderst-Kotsis, G., et al. Database and Expert Systems Applications. DEXA 2019. Communications in Computer and Information Science, vol 1062. Springer, Cham. https://doi.org/10.1007/978-3-030-27684-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-27684-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27683-6
Online ISBN: 978-3-030-27684-3
eBook Packages: Computer ScienceComputer Science (R0)