Formalization of Gene Ontology relationships with factor graph towards Biological Process prediction

Gene Ontology is a hierarchical controlled vocabulary for protein annotation. Its synergy with automatic classification methods, ensemble, has been widely used for the prediction of protein functions. Current classification methods use only the relation i

  • PDF / 650,149 Bytes
  • 4 Pages / 595.276 x 790.866 pts Page_size
  • 62 Downloads / 169 Views

DOWNLOAD

REPORT


1 CIFASIS-Conicet Institute, Bv. 27 de Febrero 210 Bis, Rosario, Argentina. Facultad Regional San Nicol´as, Col´on 332, Universidad Tecnolgica Nacional, Argentina. * [email protected]

Abstract— Gene Ontology is a hierarchical controlled vocabulary for protein annotation. Its synergy with automatic classification methods, ensemble, has been widely used for the prediction of protein functions. Current classification methods use only the relation is a and a few little part of to generate prediction model. In this work we formalize the GO part of, regulates; negatively regulates and positively regulates relationships through predicate logic. This formalization is incorporated within an ensemble method based on graph factor called Factor Graph GO Annotation. The proposed model is validated against four model organisms for GO Biological Process prediction. Keywords— Gene Ontology, Factor Graph, Automatic function prediction

I

Graph GO Annotation (FGGA) [8] which models GO relationships with logical factor nodes. The formalization must consider TPG constraint, “If the child GO-term describes the protein, then all its parent terms must also apply to that protein; and if a GO-term not describes a protein, then all its descendant GO-terms must not describe it”, that governs the structure and inference within GO-DAG. The extension of logical factor nodes within FGGA model, hereafter FGGA+ , is able to infer functional predictions of proteins by using the adapted version of sum-product algorithm [8]. This paper is organized as follows. In Section II, GO relationships are formalized thought predicate logic to be included to FGGA+ . Section III discusses the results on A. thaliana, D. melanogaster, D. rerio, and C. elegans in BP-GO. In the last Section, conclusions are presented.

I NTRODUCTION II

The high-throughput of sequencing technologies provides huge amounts of data opening unlimited opportunities for better understanding of biological behavior of target organisms. The use of machine learning methods may achieve the initial approach for data analysis focalizing experiments, saving time and money. A central point of genomic research is to establish the biological functions of proteins, also called annotation. Gene Ontology (GO) provides a hierarchical architecture of biological functions [1] which may guide the automatic annotation of protein function. GO is composed of three sub-ontologies: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). Each of them is a Directed Acyclic Graph (DAG), where every node represents a GO-term (a biological function) and every edge represents a relationship between two GO-terms. The commonly used relationships in GO are: is a (is a subtype of); part of ; regulates; negatively regulates and positively regulates [2]. Traditional ensemble methods for automatic function prediction based on GO consider the relationship is a [3], [4], [5] and a few the relationship part of [6]. In this paper, we propose the formalization of GO relationships beyond is a fo