Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2

  • PDF / 1,159,234 Bytes
  • 9 Pages / 595 x 791 pts Page_size
  • 37 Downloads / 155 Views

DOWNLOAD

REPORT


(2020) 21:741

METHODOLOGY ARTICLE

Open Access

Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2 Julie E. Hernández-Salmerón and Gabriel Moreno-Hagelsieb*

Abstract Background: Finding orthologs remains an important bottleneck in comparative genomics analyses. While the authors of software for the quick comparison of protein sequences evaluate the speed of their software and compare their results against the most usual software for the task, it is not common for them to evaluate their software for more particular uses, such as finding orthologs as reciprocal best hits (RBH). Here we compared RBH results obtained using software that runs faster than blastp. Namely, lastal, diamond, and MMseqs2. Results: We found that lastal required the least time to produce results. However, it yielded fewer results than any other program when comparing the proteins encoded by evolutionarily distant genomes. The program producing the most similar number of RBH to blastp was diamond ran with the “ultra-sensitive” option. However, this option was diamond’s slowest, with the “very-sensitive” option offering the best balance between speed and RBH results. The speeding up of the programs was much more evident when dealing with eukaryotic genomes, which code for more numerous proteins. For example, lastal took a median of approx. 1.5% of the blastp time to run with bacterial proteomes and 0.6% with eukaryotic ones, while diamond with the very-sensitive option took 7.4% and 5.2%, respectively. Though estimated error rates were very similar among the RBH obtained with all programs, RBH obtained with MMseqs2 had the lowest error rates among the programs tested. Conclusions: The fast algorithms for pairwise protein comparison produced results very similar to blast in a fraction of the time, with diamond offering the best compromise in speed, sensitivity and quality, as long as a sensitivity option, other than the default, was chosen. Keywords: Orthologs, Reciprocal best hits, Fast algorithms, Sequence comparison

Background Finding orthologs is an important step in comparative genomics and represents a central concept in evolution. Orthologs are defined as characters that diverge after a speciation event [1]. This normally means that, if the characters are genes, then they can be thought of as the same genes in different species. Because of their relationship, orthologs are expected to typically conserve their original *Correspondence: [email protected] Wilfrid Laurier University, Department of Biology, 75 University Ave W, N2L 3C5 Waterloo ON, Canada

function, an inference that has been supported by several lines of evidence [2–5]. Efforts in standardizing methods for the inference of orthology remain in constant evaluation, with over forty web services available to the community [6, 7]. Few of these methods are based on phylogenetic analyses (tree-based approach), which, despite expected to be the most accurate, tend to be computationally intensive and impractical for big datab