Naming 'junk': Human non-protein coding RNA (ncRNA) gene nomenclature

  • PDF / 176,603 Bytes
  • 9 Pages / 609.449 x 790.866 pts Page_size
  • 54 Downloads / 196 Views

DOWNLOAD

REPORT


Naming ‘junk’: Human non-protein coding RNA (ncRNA) gene nomenclature Mathew W. Wright* and Elspeth A. Bruford HUGO Gene Nomenclature Committee (HGNC), EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, CB10 1SD, UK *Correspondence to: Tel: þ44 (0)1223 494 444; Fax: þ44 (0)1223 494 468; E-mail: [email protected] Date received (in revised form): 4th October 2010

Abstract Previously, the majority of the human genome was thought to be ‘junk’ DNA with no functional purpose. Over the past decade, the field of RNA research has rapidly expanded, with a concomitant increase in the number of non-protein coding RNA (ncRNA) genes identified in this ‘junk’. Many of the encoded ncRNAs have already been shown to be essential for a variety of vital functions, and this wealth of annotated human ncRNAs requires standardised naming in order to aid effective communication. The HUGO Gene Nomenclature Committee (HGNC) is the only organisation authorised to assign standardised nomenclature to human genes. Of the 30,000 approved gene symbols currently listed in the HGNC database (http://www.genenames.org/search), the majority represent protein-coding genes; however, they also include pseudogenes, phenotypic loci and some genomic features. In recent years the list has also increased to include almost 3,000 named human ncRNA genes. HGNC is actively engaging with the RNA research community in order to provide unique symbols and names for each sequence that encodes an ncRNA. Most of the classical small ncRNA genes have now been provided with a unique nomenclature, and work on naming the long (.200 nucleotides) non-coding RNAs (lncRNAs) is ongoing. Keywords: ncRNA, RNA, nomenclature, non-protein coding

Introduction At the beginning of this century, many geneticists were predicting that the human genome contained around 100,000 protein-coding genes, partly based on the assumption that more complex organisms would have a greater number of genes. Ten years later, with far more genomic data from a wide variety of organisms and a much better-quality, well-annotated human genome, this original expectation has been downsized to around 20,000 protein-coding genes. This means that highly complex organisms like the human have about the same number of protein-coding genes as much simpler life forms such as the roundworm, Caenorhabditis elegans. If we look to the human’s closest living relative, the chimpanzee, we see that the equivalent proteins in human and chimpanzee typically differ by only two amino acids, and

90

approximately 29 per cent of all the orthologous proteins encoded in human and chimpanzee are identical.1 Why, then, when the protein-coding components of our genomes are so similar, are humans and chimpanzees so strikingly different? Since protein-coding genes comprise only two per cent of the human genome, the answer may lie in the large swathes of the genome previously regarded as ‘junk DNA’. Indeed, the ENCyclopedia Of DNA Elements (ENCODE) Consortium,2 which is aiming to identify all the functional elements in the human genome, su