Evolutionary Genomics Statistical and Computational Methods, Volume

Together with early theoretical work in population genetics, the debate on sources of genetic makeup initiated by proponents of the neutral theory made a solid contribution to the spectacular growth in statistical methodologies for molecular evolution. Ev

  • PDF / 360,601 Bytes
  • 28 Pages / 504 x 720 pts Page_size
  • 19 Downloads / 234 Views

DOWNLOAD

REPORT


roduction Protein-coding genes are the DNA sequences used as templates for the production of a functional protein. Such sequences consist of nucleotide triplets called codons. During the protein production phase, codons are transcribed and then translated into amino acids (AAs) according to the organism’s genetic code. In the past, selection studies on coding DNA mainly focused on the analysis of particular proteins of interest. With the availability of comparative genomic data, the emphasis has shifted from the study of individual proteins to genome-wide scans for selection. The overview of genomic data underlying the genome-wide analysis of protein-coding genes is included in Subheading 2.

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2, Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_5, # Springer Science+Business Media, LLC 2012

113

114

C. Kosiol and M. Anisimova

The analysis of coding sequences can be performed on three different levels: using DNA, AA, or codon sequences. The mutational processes at these three levels can be described by probabilistic models, which set the basis for evaluating selective pressures and selection tests. The fundamental properties of these models are summarized in Subheading 3.1. There is accumulating evidence that the evolutionary process varies between sites in biological sequences. Even in nonfunctional genomic regions, there appears to be variation in the mutational process. This variation is even more pronounced in active genomic segments. In protein-coding sequences, changes that impede function are unlikely to be accepted by selection (e.g., mutation in active site) while those altering less vital areas are under lower selective constraints (e.g., mutation in nonfunctional loop regions). Furthermore, systematic studies have shown that variability is not determined exclusively by selection on protein structure and function, but is also affected by the genomic position of the encoding genes, their expression patterns, their position in biological networks and their robustness to mistranslation (see ref. 1 for a review of these factors). In Fig. 1, we summarize the different levels of modeling selection on protein-coding sequences. The wedges represent the three data types: DNA, AA, and codons. Temporal heterogeneity is represented by the tree branches from lineage-specific models to analyses considering genealogies and population properties, such as the effective population size and the distribution of selective coefficients. For example, temporal heterogeneity is included in models that detect regions with accelerated regions in DNA, rate shifts in AA sequences, or the branch-specific codon models. Furthermore, the concentric layers in Fig. 1 describe different levels of modeling spatial heterogeneity in cDNA, such as phylogenetic hidden Markov models (phylo-HMMs) for DNA or branch-site models for codon sequences. Within the “Methods”

Fig. 1. A diagram illustrating the different data levels to analyze p