A Structured Interface to the Object-Oriented Genomics Unified Schema for XML-Formatted Data

  • PDF / 638,725 Bytes
  • 12 Pages / 612 x 790.56 pts Page_size
  • 93 Downloads / 165 Views

DOWNLOAD

REPORT


METHODOLOGY

© 2005 Adis Data Information BV. All rights reserved.

A Structured Interface to the Object-Oriented Genomics Unified Schema for XML-Formatted Data Terry Clark,1 Josef Jurek,2 Gregory Kettler2 and Daphne Preuss2 1 2

Department of Electrical Engineering and Computer Science, The University of Kansas, Lawrence, Kansas, USA Department of Molecular Genetics and Cell Biology, The University of Chicago, Chicago, Illinois, USA

Abstract

Data management systems are fast becoming required components in many biology laboratories as the role of computer-based information grows. Although the need for data management systems is on the rise, their inherent complexities can deter the full and routine use of their computational capabilities. The significant undertaking to implement a capable production system can be reduced in part by adapting an established data management system. In such a way, we are leveraging the Genomics Unified Schema (GUS) developed at the Computational Biology and Informatics Laboratory at the University of Pennsylvania as a foundation for managing and analysing DNA sequence data in centromere research projects around Arabidopsis thaliana and related species. Because GUS provides a core schema that includes support for genome sequences, mRNA and its expression, and annotated chromosomes, it is ideal for synthesising a variety of parameters to analyse these repetitive and highly dynamic portions of the genome. Despite this, production-strength data management frameworks are complex, requiring dedicated efforts to adapt and maintain. The work reported in this article addresses one component of such an effort, namely the pivotal task of marshalling data from various sources into GUS. In order to harness GUS for our project, and motivated by efficiency needs, we developed a structured framework for transferring data into GUS from outside sources. This technology is embodied in a GUS object-layer processor, XMLGUS. XMLGUS facilitates incorporating data into GUS by (i) formulating an XML interface that includes relational database key constraint definitions, (ii) regularising traversal through that XML, (iii) realising automatic processing of the XML with database key constraints and (iv) allowing for special processing of input data within the framework for automated processing. The application of XMLGUS to production pipeline processing for a sequencing project and inputting the Arabidopsis genome into GUS is discussed. XMLGUS is available from the Flora website (http://flora.ittc.ku.edu/).

The pronounced rise in computational models applied to molecular biology brings with it requirements for data management systems. Data integration from sundry sources adds to the requirement for management solutions, as shown by the number of databases of molecular biology information[1] and sequence data that are accumulating at exponential rates at central distribution hubs.[2] Although national centres provide central distribution of public domain data along with analysis services, laboratories generatin