On the family-free DCJ distance and similarity

  • PDF / 625,106 Bytes
  • 10 Pages / 595 x 794 pts Page_size
  • 5 Downloads / 166 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

On the family-free DCJ distance and similarity Fábio V Martinez1,2 , Pedro Feijão2 , Marília DV Braga3 and Jens Stoye2*

Abstract Structural variation in genomes can be revealed by many (dis)similarity measures. Rearrangement operations, such as the so called double-cut-and-join (DCJ), are large-scale mutations that can create complex changes and produce such variations in genomes. A basic task in comparative genomics is to find the rearrangement distance between two given genomes, i.e., the minimum number of rearragement operations that transform one given genome into another one. In a family-based setting, genes are grouped into gene families and efficient algorithms have already been presented to compute the DCJ distance between two given genomes. In this work we propose the problem of computing the DCJ distance of two given genomes without prior gene family assignment, directly using the pairwise similarities between genes. We prove that this new family-free DCJ distance problem is APX-hard and provide an integer linear program to its solution. We also study a family-free DCJ similarity and prove that its computation is NP-hard. Keywords: Genome rearrangement, DCJ, Family-free genome comparison

Background Genomes are subject to mutations or rearrangements in the course of evolution. Typical large-scale rearrangements change the number of chromosomes and/or the positions and orientations of genes. Examples of such rearrangements are inversions, translocations, fusions and fissions. A classical problem in comparative genomics is to compute the rearrangement distance, that is, the minimum number of rearrangements required to transform a given genome into another given genome [1]. In order to study this problem, one usually adopts a high-level view of genomes, in which only “relevant” fragments of the DNA (e.g., genes) are taken into consideration. Furthermore, a pre-processing of the data is required, so that we can compare the content of the genomes. One popular method, adopted for more than 20 years, is to group the genes in both genomes into gene families, so that two genes in the same family are said to be equivalent. This setting is said to be family-based. Without gene duplications, that is, with the additional restriction that each family occurs exactly once in each genome, many polynomial models have been proposed to compute the genomic distance [2-5]. However, when gene *Correspondence: [email protected] 2 Technische Fakultät and CeBiTec, Universität Bielefeld, Universitätsstr. 25, 33615 Bielefeld, Germany Full list of author information is available at the end of the article

duplications are allowed, the problem is more intrincate and all approaches proposed so far are NP-hard, see for instance [6-10]. It is not always possible to classify each gene unambiguously into a single gene family. Due to this fact, an alternative to the family-based setting was proposed recently and consists in studying the rearrangement distance without prior family assignment. Instead of fami