An experimental study of information content measurement of gene ontology terms
- PDF / 2,066,758 Bytes
- 13 Pages / 595.276 x 790.866 pts Page_size
- 1 Downloads / 134 Views
ORIGINAL ARTICLE
An experimental study of information content measurement of gene ontology terms Marianna Milano1 • Giuseppe Agapito1 • Pietro H. Guzzi1 • Mario Cannataro1
Received: 2 April 2015 / Accepted: 16 December 2015 Ó Springer-Verlag Berlin Heidelberg 2016
Abstract The gene ontology (GO) is commonly used to store and organize information about functions of biological molecules through a controlled vocabulary of terms (GO Terms). GO Terms refer to biological concepts through the annotation process. There exist many different annotation processes used by researchers. Each term has a different specificity that is formally measured by the information content (IC). Both the structure of GO and the corpora of annotations are continuously changing following novel experimental findings. This work focuses on how changes of annotations affect the IC of terms. The study confirms that statistically significant differences among annotation corpus of different years on each species occur. These results convey that annotation corpora changes have a high impact on IC. Keywords Information content Gene ontology Semantic similarity
1 Introduction The gene ontology (GO) [1] is a large vocabulary containing the representation of biological knowledge through GO terms. Each GO term contains a concise and & Marianna Milano [email protected] & Pietro H. Guzzi [email protected] Giuseppe Agapito [email protected] Mario Cannataro [email protected] 1
Department of Surgical and Medical Sciences, University of Catanzaro, Catanzaro, Italy
unambiguous description of a concept, and it is identified by a unique code. The GO consists of three taxonomies or sub-ontology: biological processes (BP), molecular functions (MF) and cellular components (CC). Each sub-ontology describes a particular aspect or the function of a molecule. The GO structure is modeled as a directed acyclic graph (DAG). Each node is a GO Term, and each edge features the relationships among terms, (e.g. regulates, has part, part of and is a) [2]. The GO Terms are used to describe genes and gene products of different species. The annotation process enables one to associate each GO term with any number of genes, proteins or molecules. The gene ontology annotation database stores the corpus of annotations and it contains over 200 million annotations periodically updated [3]. Each annotation has a source and database entry attributed to it. These sources can be a literature reference, a single database reference, or computational evidence. There exist 14 different annotation processes that are identified by the evidence code (EC). Each EC is related to a process that is used to associate a GO term to a concept, which explains the basis for the annotation. A main difference among annotation processes is the intervention of humans. Consequently we have manual annotations, i.e. annotations that have been verified by curators, and annotations that are associated in a fully automated way [2]. Curators on the basis of a review of the literature experimentally verify the manu
Data Loading...