BioDQ: Data Quality Estimation and Management for Genomics Databases

We present BIODQ, a model for estimating and managing the quality of biological data in genomics repositories. BIODQ uses our Quality Estimation Model (QEM) which has been implemented as part of the Quality Management Architecture (QMA). The QEM consists

  • PDF / 259,880 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 119 Downloads / 169 Views

DOWNLOAD

REPORT


Abstract. We present BIODQ, a model for estimating and managing the quality of biological data in genomics repositories. BIODQ uses our Quality Estimation Model (QEM) which has been implemented as part of the Quality Management Architecture (QMA). The QEM consists of a set of quality dimensions and their quantitative measures. The QMA combines a series of software components that enable the integration of QEM with existing genomics repositories. The basis of our experimental evaluation is a research study conducted among biologists. Evaluation results show that the QEM dimensions and estimations are biologically-relevant and useful for discriminating high quality from low quality data. The most relevant capabilities of the QMA are also presented. Keywords: Data Quality, Genomics Databases, GenBank, RefSeq, quality dimension, measure, estimation, management, classification, architecture.

1 Introduction The rapid accumulation of biological information as well as their widespread usage by scientists to carry out research is posing new challenges to determine and help manage the quality of data in public genomics repositories. Genbank [1], RefSeq [2], and Swissprot [3] are prominent examples of public repositories extensively used by genomics researchers and practitioners, and biologists in general. Analysis and processing of low-quality data may result in wasted time and resources, or may lead scientists to false conclusions, thus hampering scientific progress. Several quality models and assessment methodologies have been proposed in the literature, but most were developed in the context of enterprise data warehousing and addressed quality problems existing in the business domain. These methodologies do not fit naturally into the genomics context because biological data is more complex and less structured than typical business data. In addition, the increasing data generation and usage rates limit the kind of quality assessments that can realistically be performed. We therefore believe that there is a need for automated quality assessment techniques that provide users of genomics data sources with objective and quantitative estimates of the quality of available data. *

The author's current affiliation and address is: Microsoft Corp., One Microsoft Way, Redmond, WA 98052. Email: [email protected]

I. Măndoiu, R. Sunderraman, and A. Zelikovsky (Eds.): ISBRA 2008, LNBI 4983, pp. 469–480, 2008. © Springer-Verlag Berlin Heidelberg 2008

470

A. Martinez, J. Hammer, and S. Ranka

1.1 How Do Genomics Data Sources Currently Manage Quality? To discover how public repositories of genomics data manage quality, we focused our study on the databases of the National Center for Biotechnology Information (NCBI) [4] because of their widespread use by the scientific community. The three major problems found related to quality are described next. First, genomics data sources currently provide minimal information about the quality of the stored data. Some repositories offer base-calling scores, which are quality indicators of the