Gene Orthology Inference via Large-Scale Rearrangements for Partially Assembled Genomes

Gene Orthology Inference via Large-Scale Rearrangements for Partially Assembled Genomes

Authors Diego P. Rubert , Marília D. V. Braga



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.24.pdf
  • Filesize: 1.3 MB
  • 22 pages

Document Identifiers

Author Details

Diego P. Rubert
  • Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS, Brasil
  • Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany
Marília D. V. Braga
  • Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany

Acknowledgements

We thank the anonymous reviewers for their valuable comments.

Cite As Get BibTex

Diego P. Rubert and Marília D. V. Braga. Gene Orthology Inference via Large-Scale Rearrangements for Partially Assembled Genomes. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 24:1-24:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.WABI.2022.24

Abstract

Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. Although the ILP is quite efficient and could conceptually analyze genomes that are not completely assembled but split in several contigs, our tool failed in completing that task. The main reason is that each ILP pairwise comparison includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space.
In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into m ≥ 1 subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on real data show that we can now efficiently analyze fruit fly genomes with unfinished assemblies distributed in hundreds or even thousands of contigs, obtaining orthologies that are more similar to FlyBase orthologies when compared to orthologies computed by other inference tools. Moreover, for complete assemblies the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the optimal version of our tool. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
Keywords
  • Comparative genomics
  • double-cut-and-join
  • indels
  • gene orthology

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Adrian M Altenhoff, Jeremy Levy, Magdalena Zarowiecki, Bartłomiej Tomiczek, Alex Warwick Vesztrocy, Daniel A Dalquen, Steven Müller, Maximilian J Telford, Natasha M Glover, David Dylus, et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res, 29(7):1152-1163, 2019. Google Scholar
  2. Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. J Mol Biol, 215(3):403-410, 1990. Google Scholar
  3. Sébastien Angibaud, Guillaume Fertin, Irena Rusu, Annelyse Thévenin, and Stéphane Vialette. On the approximability of comparing genomes with duplicates. J Graph Algo App, 13(1):19-53, 2009. URL: https://doi.org/10.7155/jgaa.00175.
  4. Anne Bergeron, Julia Mixtacki, and Jens Stoye. A unifying view of genome rearrangements. In Proc. of WABI, volume 4175 of Lecture Notes in Bioinformatics, pages 163-173, 2006. URL: https://doi.org/10.1007/11851561_16.
  5. Leonard Bohnenkämper, Marília D. V. Braga, Daniel Doerr, and Jens Stoye. Computing the rearrangement distance of natural genomes. J Comput Biol, 28(4):410-431, 2021. URL: https://doi.org/10.1089/cmb.2020.0434.
  6. Marília D. V. Braga, Cedric Chauve, Daniel Doerr, Katharina Jahn, Jens Stoye, Annelyse Thévenin, and Roland Wittler. The potential of family-free genome comparison. In C. Chauve, N. El-Mabrouk, and E. Tannier, editors, Models and Algorithms for Genome Evolution, volume 19 of Computational Biology Series, chapter 13, pages 287-307. Springer Verlag, Berlin, 2013. URL: https://doi.org/10.1007/978-1-4471-5298-9_13.
  7. Marília D. V. Braga, Eyla Willing, and Jens Stoye. Double cut and join with insertions and deletions. J Comput Biol, 18(9):1167-1184, 2011. URL: https://doi.org/10.1089/cmb.2011.0118.
  8. David Bryant. The complexity of calculating exemplar distances. In David Sankoff and Joseph H. Nadeau, editors, Comparative Genomics, volume 1 of Computational Biology Series, pages 207-211. Kluver Academic Publishers, London, 2000. URL: https://doi.org/10.1007/978-94-011-4309-7_19.
  9. Benjamin Buchfink, Chao Xie, and Daniel H. Huson. Fast and sensitive protein alignment using DIAMOND. Nat Methods, 12:59-60, 2015. Google Scholar
  10. C. Dessimoz, G. Cannarozzi, M. Gil, D. Margadant, A. C. J. Roth, A. Schneider, and G. H. Gonnet. OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements. In Proc. of RECOMB-CG, volume 3678 of Lecture Notes in Bioinformatics, pages 61-72, 2005. Google Scholar
  11. Daniel Doerr, Pedro Feijão, and Jens Stoye. Family-free genome comparison. In João C. Setubal, Jens Stoye, and Peter F. Stadler, editors, Comparative Genomics: Methods and Protocols, volume 1704 of Methods in Molecular Biology, pages 331-342. Springer Nature, New York, 2018. URL: https://doi.org/10.1007/978-1-4939-7463-4_12.
  12. P. Hall. On representatives of subsets. Journal of the London Mathematical Society, s1-10(1):26-30, 1935. Google Scholar
  13. Sridhar Hannenhalli and Pavel A. Pevzner. Transforming men into mice (polynomial algorithm for genomic distance problem). In Proc. of FOCS, pages 581-592, 1995. URL: https://doi.org/10.1109/SFCS.1995.492588.
  14. Marcus Lechner, Sven Findeiß, Lydia Steiner, Manja Marz, Peter F. Stadler, and Sonja J. Prohaska. Proteinortho: Detection of (co-)orthologs in large-scale analysis. BMC Bioinform, 12(124), 2011. Google Scholar
  15. Marcus Lechner, Maribel Hernandez-Rosales, Daniel Doerr, Nicolas Wieseke, Annelyse Thévenin, Jens Stoye, Roland K. Hartmann, Sonja J. Prohaska, and Peter F. Stadler. Orthology detection combining clustering and synteny for very large datasets. PLoS One, 9(8:e105015), 2014. Google Scholar
  16. Fábio V. Martinez, Pedro Feijao, Marília D. V. Braga, and Jens Stoye. On the family-free DCJ distance and similarity. Algorithms Mol Biol, 13(10), 2015. URL: https://doi.org/10.1186/s13015-015-0041-9.
  17. Alexander C. J. Roth, Gaston H. Gonnet, and Christophe Dessimoz. Algorithm of OMA for large-scale orthology inference. BMC Bioinform, 9(518), 2008. Google Scholar
  18. Diego P. Rubert, Daniel Doerr, and Marília D. V. Braga. The potential of family-free rearrangements towards gene orthology inference. J Bioinform Comput Biol, 19(6):2140014, 2021. URL: https://doi.org/10.1142/S021972002140014X.
  19. Diego P. Rubert, Fábio V. Martinez, and Marília D. V. Braga. Natural Family-Free Genomic Distance. Algorithms Mol Biol, 16(4), 2021. URL: https://doi.org/10.1186/s13015-021-00183-8.
  20. David Sankoff. Genome rearrangement with gene families. Bioinformatics, 15(11):909-917, 1999. URL: https://doi.org/10.1093/bioinformatics/15.11.909.
  21. Mingfu Shao, Yu Lin, and Bernard Moret. An exact algorithm to compute the double-cut-and-join distance for genomes with duplicate genes. J Comput Biol, 22(5):425-435, 2015. URL: https://doi.org/10.1089/cmb.2014.0096.
  22. Guanqun Shi, Liqing Zhang, and Tao Jiang. MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement. BMC Bioinform, 11(10), 2010. Google Scholar
  23. Tamir Tassa. Finding all maximally-matchable edges in a bipartite graph. Theoretical Computer Science, 423:50-58, 2012. Google Scholar
  24. Sophia Yancopoulos, Oliver Attie, and Richard Friedberg. Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics, 21(16):3340-3346, 2005. URL: https://doi.org/10.1093/bioinformatics/bti535.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail