Site-Specific Amino Acid Distributions Follow a Universal Shape

  • PDF / 2,414,747 Bytes
  • 11 Pages / 595.276 x 790.866 pts Page_size
  • 106 Downloads / 183 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Site‑Specific Amino Acid Distributions Follow a Universal Shape Mackenzie M. Johnson1,2   · Claus O. Wilke1  Received: 5 August 2020 / Accepted: 17 November 2020 / Published online: 24 November 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In many applications of evolutionary inference, a model of protein evolution needs to be fitted to the amino acid variation at individual sites in a multiple sequence alignment. Most existing models fall into one of two extremes: Either they provide a coarse-grained description that lacks biophysical realism (e.g., dN/dS models), or they require a large number of parameters to be fitted (e.g., mutation–selection models). Here, we ask whether a middle ground is possible: Can we obtain a realistic description of site-specific amino acid frequencies while severely restricting the number of free parameters in the model? We show that a distribution with a single free parameter can accurately capture the variation in amino acid frequency at most sites in an alignment, as long as we are willing to restrict our analysis to predicting amino acid frequencies by rank rather than by amino acid identity. This result holds equally well both in alignments of empirical protein sequences and of sequences evolved under a biophysically realistic all-atom force field. Our analysis reveals a near universal shape of the frequency distributions of amino acids. This insight has the potential to lead to new models of evolution that have both increased realism and a limited number of free parameters. Keywords  Amino-acid distributions · Protein site variability · Evolutionary modeling

Introduction To uncover the relationship between and the history of various protein sequences across populations and species, evolutionary biologists frequently fit mathematical models of evolution to homologous sequence alignments. Common applications of such models include phylogenetic tree reconstruction, assessment of strength and type of selection, and evolutionary rate inference. Early models had only one or two free parameters per alignment (Jukes and Cantor 1969; Kimura 1980), but over time models have become more complex and realistic (Goldman and Yang 1994; Yang Handling editor: David Liberles Electronic supplementary material  The online version of this article (https​://doi.org/10.1007/s0023​9-020-09976​-8) contains supplementary material, which is available to authorized users. * Claus O. Wilke [email protected] 1



Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, USA



Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA

2

and Bielawski 2000; Halpern and Bruno 1998; Kosakovsky Pond and Frost 2005; Yang and Nielsen 2008; Arenas 2015). An important insight from work in this area has been that evolving proteins display substantial variation among individual sites (Echave and Wilke 2017), and thus, site-specific models are critical. In part to address this insig