Bioinformatics for DNA Sequence Analysis

The storage, processing, description, transmission, connection, and analysis of the waves of new genomic data have made bioinformatics skills essential for scientists working with DNA sequences. In Bioinformatics for DNA Sequence Analysis, experts in the

  • PDF / 899,022 Bytes
  • 22 Pages / 547.087 x 737.008 pts Page_size
  • 38 Downloads / 236 Views

DOWNLOAD

REPORT


1. Introduction 1.1. An Introduction to Nucleotide Databases

Perhaps the central goal of genetics is to articulate the associations of phenotypes of interest with their underlying genetic components and then to understand the relationship between genetic variation and variation in the phenotype. This goal has been buoyed by the tremendous increase in our ability to obtain molecular genetic data, across both populations and species. As methods of gathering information about various aspects of biological macromolecules arose, biological information became abundant and the need to consolidate and make this information accessible became increasingly apparent. In the early 1960s, Margaret Dayhoff and colleagues at the National Biomedical Research Foundation (NBRF) began collecting information on protein sequences and structure into a volume entitled Atlas of Protein Sequence and Structure (1). Since that beginning, databases have been an important and essential part of biological and biochemical research.

David Posada (ed.), Bioinformatics for DNA Sequence Analysis, Methods in Molecular Biology 537 ª Humana Press, a part of Springer ScienceþBusiness Media, LLC 2009 DOI 10.1007/978-1-59745-251-9_1

1

2

Menlove, Clement, and Crandall

By 1972, the size of the Atlas had become unwieldy, so Dr. Dayhoff, a pioneer of bioinformatics, developed a database infrastructure into which this information could be funneled. Though nucleotide information was included in the Atlas as early as 1966 (2), its bulk was comprised of amino acid sequences with structural annotation. 1.2. International Nucleotide Sequence Database Collaboration: DDBJ, EMBL, and GenBank

It was not until 1982 that databases were developed with the express purpose of storing nucleotide sequences by the European Molecular Biology Laboratory (EMBL: http://www.embl.org/) in Europe and the National Institutes of Health (NIH – NCBI: http:// www.ncbi.nlm.nih.gov/) in North America. Japan followed suit with the creation of their DNA Databank (DDBJ: http:// www.ddbj.nig.ac.jp/) in 1986. A sizeable amount of sharing naturally occurred between these three databases and the Genome Sequence Database, also in North America, a condition that led to their coalition in 1988 under the title International Nucleotide Sequence Database Collaboration (INSDC). They still remain very distinct entities, but in the 1988 meeting, they established policies to govern the formatting of and stewardship over the sequences each receives. Their current policies include unrestricted access and use of all data records, proper citation of data originators, and the responsibilities of submitters to verify the validity of the data and their right to submit it. The INSDC currently contains approximately 80 billion base pairs (bp) (not including whole-genome shotgun sequences) and nearly 80 million sequence entries. Including shotgun sequences (HTGS), it passed the 100-gigabase mark on August 22, 2005, and contains approximately 200 billion bp as of September 2007. For more than 10 years, the amou