A generalized Robinson-Foulds distance for labeled trees

  • PDF / 2,107,512 Bytes
  • 13 Pages / 595 x 791 pts Page_size
  • 58 Downloads / 159 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

A generalized Robinson-Foulds distance for labeled trees Samuel Briand1 , Christophe Dessimoz2,3,4,5,6* , Nadia El-Mabrouk1* , Manuel Lafond7 and Gabriela Lobinska3 From The 18th Asia Pacific Bioinformatics Conference Seoul, Korea. 18-20 August 2020

Abstract Background: The Robinson-Foulds (RF) distance is a well-established measure between phylogenetic trees. Despite a lack of biological justification, it has the advantages of being a proper metric and being computable in linear time. For phylogenetic applications involving genes, however, a crucial aspect of the trees ignored by the RF metric is the type of the branching event (e.g. speciation, duplication, transfer, etc). Results: We extend RF to trees with labeled internal nodes by including a node flip operation, alongside edge contractions and extensions. We explore properties of this extended RF distance in the case of a binary labeling. In particular, we show that contrary to the unlabeled case, an optimal edit path may require contracting “good” edges, i.e. edges shared between the two trees. Conclusions: We provide a 2-approximation algorithm which is shown to perform well empirically. Looking ahead, computing distances between labeled trees opens up a variety of new algorithmic directions. Implementation and simulations available at https://github.com/DessimozLab/pylabeledrf. Keywords: Edit distance, Labeled trees, Robinson-Foulds, Tree metric

Background Phylogenic trees represent the evolutionary relationship between sets of genetic elements or taxa, where the elements of a set are in one-to-one relationship with the leaves of the corresponding tree [1]. Different phylogenetic inference methods may lead to different trees, and each method, typically exploring a large space of trees, can also result in multiple equally likely solutions for the same dataset. It follows that comparing trees is an essential task for finding out how inferred trees are far from one another, or how an inferred tree is far from a simulated tree or from a gold standard tree for the same datasets. *Correspondence: [email protected]; [email protected] Department of Computational Biology, University of Lausanne, Lausanne, Switzerland 1 Computer Science Department, Université de Montréal, Montreal, Canada Full list of author information is available at the end of the article 2

Designing appropriate tree metrics is a widely explored branch of research. A variety of measures have been designed for different types of trees, rooted or unrooted, some restricted to comparing tree shapes [2], others considering multilabeled trees, i.e. trees with repeated leaf labels [3] and yet others considering information on edge length [4]. In particular, a large number of pairwise measures of similarity or dissimilarity have been developed for comparing two topologies on the same leafset. Among them are the methods based on counting the structural differences between the two trees in terms of path length, bipartitions or quartets for unrooted trees, clades or tripl