In Silico Tools for Gene Discovery

As functional genomics has become one of the major focuses in molecular biology, the need for more sophisticated tools to assist in the identification of the functionality of undefined genes and the correlation of DNA variants with a particular phenotype

  • PDF / 789,662 Bytes
  • 15 Pages / 504 x 720 pts Page_size
  • 38 Downloads / 209 Views

DOWNLOAD

REPORT


1. Introduction Before the advances in molecular biology, genes were merely abstract units of hereditary known only from the phenotypic expressions of genetic variants (alleles). We now define alleles from variations in DNA sequences. The smallest unit of variation is a change of a single base, either as a substitution (singlenucleotide polymorphism, SNP) or as an insertion/deletion of a base (Indel). A number of in silico tools have been developed to assist in SNP and Indel analysis. Whatever method is used for detecting DNA variants, all putative novel variants must be unequivocally verified by DNA sequencing. Much effort is thus B. Yu, M. Hinchcliffe (eds.), In Silico Tools for Gene Discovery, Methods in Molecular Biology 760, DOI 10.1007/978-1-61779-176-5_13, © Springer Science+Business Media, LLC 2011

207

208

Tongsima et al.

focused toward resequencing genomic regions among cohorts of individuals. The availability of genome sequences has greatly facilitated the process of DNA variant discovery by the resequencing approach. Novel DNA variants within candidate regions may be rare, in which case the same region may have to be analyzed among several individuals. The “shotgun” approach using nextgeneration sequencing methods is not appropriate for this task, as most variants discovered would not be within the target region and the cost is still too high to be practical for this application. The conventional/Sanger resequencing approach for variant discovery begins with design of overlapping PCR amplicons for the candidate genomic region from the reference genome sequence. The amplicons are limited to a few hundred base pairs each, since the maximum sequence read length is approximately 800 bp. PCR primers must be designed to specifically amplify the target genomic region and avoid repetitive sequence (including pseudogenes), known SNPs in primers, high GC content, and known copy number variation regions. PCR primer design is facilitated by “in silico PCR” tools, which are described in Chapters 6 and 18. Optimal PCR conditions for each primer pair also need to be determined empirically. Once the conditions optimal for each amplicon are known, the amplicons are sequenced using the same primers. Sequencing is carried out by the Sanger method (1) using BigDye terminator reaction chemistry, and bases detected with capillary-based sequencing machines (2). Fluorescence-based sequencers generate two data files for each sample read, a chromatogram trace file (e.g., .abi, .scf, .alf, .ctf, and .ztr) and a FASTA base-called sequence file. The automatic base-calling procedure used to generate the FASTA sequence translates the different fluorescent intensities from the chromatogram trace file. When more than one base signal is detected at a calling position, the International Union of Pure and Applied Chemistry (IUPAC) ambiguous nucleotide codes are assigned to that position. Since heterozygous individuals are more common than homozygotes, variants typically manifest in chromatogram traces as mixed signals. These signals a