Abstract
In clustering, one of the most challenging aspects is the validation, whose objective is to evaluate how good a clustering solution is. Sequence binning is a clustering task on metagenomic data analysis. The sequence clustering challenge is essentially putting together sequences belonging to the same genome. As a clustering problem it requires proper use of validation criteria of the discovered partitions. In sequence binning, the concepts of precision and recall, and F-measure index (external validation) are normally used as benchmark. However, on practice, information about the (sub) optimal number of cluster is unknown, so these metrics might be biased to an overestimated “ground truth”. In the case of sequence binning analysis, where the reference information about genomes is not available, how to evaluate the quality of bins resulting from a clustering solution? To answer this question we empirically study both quantitative (internal indexes) and qualitative aspects (biological soundness) while evaluating clustering solutions on the sequence binning problem. Our experimental study indicates that the number of clusters, estimated by binning algorithms, do not have as much impact on the quality of bins by means of biological soundness of the discovered clusters. The quality of the sub-optimal bins (greater than 90%) were identified in both rich and poor clustering partitions. Qualitative validation is essential for proper evaluation of a sequence binning solution, generating bins with sub-optimal quality. Internal indexes can only be used in compliance with qualitative ones as a trade-off between the number of partitions and biological soundness of its respective bins.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mande, S.S.: Classification of metagenomic sequences: methods and challenges. Brief. Bioinform. 13, 669–681 (2012)
Sedlar, K.: Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput. Struct. Biotechnol. J. 15, 48–55 (2017)
Wang, Y., et al.: MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28(18), i356–i362 (2012)
Vinh, L., et al.: A two-phase binning algorithm using \(l\)-mer frequency on groups of non-overlapping reads. Algorithms Mol. Biol. 10, 2 (2015). https://doi.org/10.1186/s13015-014-0030-4
Wang, Y., et al.: MBBC: an efficient approach for metagenomic binning based on clustering. BMC Bioinform. 16, 36 (2015)
Wu, Y., et al.: MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2, 26 (2014). https://doi.org/10.1186/2049-2618-2-26
Lin, H., Yu-Chieh, L.: Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci. Rep. 6, 24175 (2016)
Parks, D., et al.: CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015)
Simão, F., et al.: BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 1367–4803 (2015)
Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. Trans. Pattern Anal. Mach. Intell. 1(2), 224–227 (1979)
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)
Li, W., et al.: Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinform. 13(6), 656–668 (2012)
Kang, D., Froula, J., Egan, R., Wang, Z.: MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015)
Sieber, C., et al.: Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol. 3, 836–843 (2018)
Van Craenendonck, T., Blockeel, H.: Using internal validity measures to compare clustering algorithms. Benelearn (2015)
Legány, C., Juhász, S., Babos, A.: Cluster validity measurement techniques. In: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence (2006)
Alves, R., Rodriguez-Baena, D.S., Aguilar-Ruiz, J.S.: Gene association analysis: a survey of frequent pattern mining from gene expression data. Brief. Bioinform. 11(2), 210–224 (2010)
Mikheenko, A., Saveliev, V., Gurevich, A.: MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32(7), 1088–1090 (2016)
Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)
Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)
Reyes, P., Villegas, C.: An empirical comparison of EM and K-means algorithms for binning metagenomics datasets. Ingeniare. Rev. Chil. Ing. 26, 20–27 (2018)
Richter, D.C., et al.: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS ONE 3, e3373 (2018)
Alneberg, J., Bjarnason, B.S., De Bruijn, I., Schirmer, M., Quick, J., Ijaz, U.Z., et al.: Binning metagenomic contigs by coverage and composition. Nat. Methods 11(11), 1144–1146 (2014)
Baridam, B.B., Ali, M.M.: An investigation of K-means clustering to high and multi-dimensional biological data. Kybernetes 42(4), 614–627 (2013)
Li, D., et al.: MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3–11 (2016)
Parks, D., et al.: Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017)
Khan, A.R., et al.: A comprehensive study of de novo genome assemblers: current challenges and future prospective. Evol. Bioinform. Online 14 (2018)
Krakauer, D.C., Plotkin, J.B.: Redundancy, antiredundancy, and the robustness of genomes. Proc. Nat. Acad. Sci. U.S.A. 99(3), 1405–1409 (2002)
Chen, H.W., et al.: Predicting genome-wide redundancy using machine learning. BMC Evol. Biol. 10, 1471–2148 (2010)
Klassen, J.L., Currie, C.R.: Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genom. 13, 14 (2012)
Poptsova, M.S., et al.: Non-random DNA fragmentation in next-generation sequencing. Sci. Rep. 4, 4532 (2014)
Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D., Gurevich, A.: Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34(13), i142–i150 (2018)
Sangwan, N., Xia, F., Gilbert, J.: Recovering complete and draft population genomes from metagenome datasets. Microbiome 04(1), 2049–2618 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Oliveira, P., Padovani, K., Alves, R. (2020). On Clustering Validation in Metagenomics Sequence Binning. In: Kowada, L., de Oliveira, D. (eds) Advances in Bioinformatics and Computational Biology. BSB 2019. Lecture Notes in Computer Science(), vol 11347. Springer, Cham. https://doi.org/10.1007/978-3-030-46417-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-46417-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46416-5
Online ISBN: 978-3-030-46417-2
eBook Packages: Computer ScienceComputer Science (R0)