Genome data classification based on fuzzy matching

  • PDF / 896,736 Bytes
  • 20 Pages / 595.276 x 790.866 pts Page_size
  • 5 Downloads / 215 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Genome data classification based on fuzzy matching Nagamma Patil • Durga Toshniwal Kumkum Garg



Received: 18 June 2012 / Accepted: 13 August 2012 / Published online: 13 October 2012  CSI Publications 2012

Abstract Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method. N. Patil (&)  D. Toshniwal Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee, India e-mail: [email protected]; [email protected] D. Toshniwal e-mail: [email protected] K. Garg Department of Computer Science & Engineering, Manipal University, Jaipur, India e-mail: [email protected]

Keywords Bioinformatics  Soft computing  Genome data  Data mining  Approximate pattern matching  Exact matching

1 Introduction Bioinformatics [1–3], has emerged as a forefront research area in the recent past since biological data is accumulating at an accelerated rate. In particular, the number and sizes of genome databases have grown rapidly over the last few years. One of the most important problems is automatically determining the group to which a previously unseen genome sequence belongs [4]. Classifying organisms from its genomic database into groups within a taxonomical hierarchy has several applications which include specific identification of any unknown organism, study of evolutionary characteristics, and study of mutual relationship existing between organisms [5]. Currently more than a million organisms have been discovered, but a large number are yet to be discovered. Any systematic study on an organism can be done only when it is identified to be in a particular group. Thus genome identification finds wide application in evolutionary studies of organisms. Classification and species identification have also been associated with practical app