Algorithms for Computational Biology First International Conference,

This book constitutes the refereed proceedings of the First International Conference, AlCoB 2014, held in July 2014 in Tarragona, Spain.The 20 revised full papers were carefully reviewed and selected from 39 submissions. The scope of AlCoB includes topics

  • PDF / 1,538,747 Bytes
  • 13 Pages / 595.276 x 841.89 pts (A4) Page_size
  • 57 Downloads / 241 Views

DOWNLOAD

REPORT


To cite this version: Claire Lemaitre, Liviu Ciortuz, Pierre Peterlongo. Mapping-free and assembly-free discovery of inversion breakpoints from raw NGS reads. Adrian-Horia Dediu; Carlos Mart´ın-Vide; Bianca Truthe. Algorithms for Computational Biology, Jul 2014, Tarragona, Spain. Lecture Notes in Computer Science, 8542, pp.119-130, .

HAL Id: hal-01063157 https://hal.inria.fr/hal-01063157v3 Submitted on 17 Nov 2014

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destin´ee au d´epˆot et `a la diffusion de documents scientifiques de niveau recherche, publi´es ou non, ´emanant des ´etablissements d’enseignement et de recherche fran¸cais ou ´etrangers, des laboratoires publics ou priv´es.

Mapping-Free and Assembly-Free Discovery of Inversion Breakpoints from Raw NGS Reads Claire Lemaitre1 , Liviu Ciortuz1,2 , and Pierre Peterlongo1 1

INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes cedex, France {claire.lemaitre,pierre.peterlongo}@inria.fr 2 Faculty of Computer Science Iasi, Romania [email protected]

Abstract. We propose a formal model and an algorithm for detecting inversion breakpoints without a reference genome, directly from raw NGS data. This model is characterized by a fixed size topological pattern in the de Bruijn Graph. We describe precisely the possible sources of false positives and false negatives and we additionally propose a sequence-based filter giving a good trade-off between precision and recall of the method. We implemented these ideas in a prototype called TakeABreak. Applied on simulated inversions in genomes of various complexity (from E. coli to a human chromosome dataset), TakeABreak provided promising results with a low memory footprint and a small computational time. Keywords: structural variant, NGS, reference-free, de Bruijn graph.

1

Introduction

Structural variation is an important source of variations in genomes, that can be involved in phenotypic variations, inherited diseases, evolution and speciation. The extent of structural variations in populations has been only recently acknowledged, thanks mainly to next generation sequencing (NGS). In fact, by sequencing the genomes of several human individuals, one can find more DNA involved in structural variations than in single nucleotide polymorphism (SNP) [8]. However, due to the small size of the reads these variants are much more difficult to identify than SNPs. Most methods proposed so far rely on mapping the reads on a reference genome. The main approach calls structural variant breakpoints when mapped read pairs show discordant mappings with respect to expected insert-size and orientation of the reads [7]. Due mainly to repetitions in complex genomes and mapping errors, these methods suffer from high false positive rates and a small overlap between predi