Genome data classification based on fuzzy matching

PDF / 896,736 Bytes
20 Pages / 595.276 x 790.866 pts Page_size
5 Downloads / 299 Views

ORIGINAL RESEARCH

Genome data classification based on fuzzy matching Nagamma Patil • Durga Toshniwal Kumkum Garg

•

Received: 18 June 2012 / Accepted: 13 August 2012 / Published online: 13 October 2012 CSI Publications 2012

Abstract Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method. N. Patil (&) D. Toshniwal Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee, India e-mail: [email protected]; [email protected] D. Toshniwal e-mail: [email protected] K. Garg Department of Computer Science & Engineering, Manipal University, Jaipur, India e-mail: [email protected]

Keywords Bioinformatics Soft computing Genome data Data mining Approximate pattern matching Exact matching

1 Introduction Bioinformatics [1–3], has emerged as a forefront research area in the recent past since biological data is accumulating at an accelerated rate. In particular, the number and sizes of genome databases have grown rapidly over the last few years. One of the most important problems is automatically determining the group to which a previously unseen genome sequence belongs [4]. Classifying organisms from its genomic database into groups within a taxonomical hierarchy has several applications which include specific identification of any unknown organism, study of evolutionary characteristics, and study of mutual relationship existing between organisms [5]. Currently more than a million organisms have been discovered, but a large number are yet to be discovered. Any systematic study on an organism can be done only when it is identified to be in a particular group. Thus genome identification finds wide application in evolutionary studies of organisms. Classification and species identification have also been associated with practical app

Data Loading...

Genome data classification based on fuzzy matching

Recommend Documents

A Fuzzy Logic Based Approach for Data Classification

Fuzzy Classification for Gene Expression Data Analysis

Automatic Classification Method Based on a Fuzzy Similarity Relation

A novel classification algorithm based on kernelized fuzzy rough sets

Fuzzy ELM for classification based on feature space

Privacy-Preserving Pattern Matching on Encrypted Data

Multiple Fuzzy Classification Systems

Genetic Learning Analysis of Fuzzy Rule-Based Classification Systems Considering Data Reduction

Imbalanced Data Classification Method Based on Clustering and Voting Mechanism

Rule-Based Classification for Evidential Data

Using Fuzzy Logic for Product Matching

Kernel Matching Reduction Algorithms for Classification