Modified Differential Evolution for Biochemical Name Recognizer
In this paper we propose a modified differential evolution (MDE) based feature selection and ensemble learning algorithms for biochemical entity recognizer. Identification and classification of chemical entities are relatively more complex and challenging
- PDF / 229,986 Bytes
- 12 Pages / 439.363 x 666.131 pts Page_size
- 2 Downloads / 198 Views
stract. In this paper we propose a modified differential evolution (MDE) based feature selection and ensemble learning algorithms for biochemical entity recognizer. Identification and classification of chemical entities are relatively more complex and challenging compared to the other related tasks. As chemical entities we focus on IUPAC and IUPAC related entities. The algorithm performs feature selection within the framework of a robust machine learning algorithm, namely Conditional Random Field. Features are identified and implemented mostly without using any domain specific knowledge and/or resources. In this paper we modify traditional differential evolution to perform two tasks, viz. determining relevant set of features as well as determining proper voting weights for constructing an ensemble. The feature selection technique produces a set of potential solutions on the final population. We develop many models of CRF using these feature combinations. In order to further improve the performance the outputs of these classifiers are combined together using a classifier ensemble technique based on modified DE. Our experiments with the benchmark datasets yield the recall, precision and F-measure values of 82.34%, 88.26% and 85.20%, respectively. Keywords: Modified Differential Evolution (MDE), Conditional Random Field (CRF), Feature Selection, Ensemble, Biochemical Named Entity.
1
Introduction
In recent times, information extraction has drawn huge attention to the practitioners and researchers. A large amount of online information is unorganized and a large number of data documents are added to it daily, so organizing, finding and extracting relevant information from such a huge amount of data is an important challenge in our day-to-day life. In life science publications and patents, chemical compounds like small signal molecules or other biological active chemical substances are the important entity classes. There exist many representations and nomenclatures for chemical names. Some examples are SMILES, InChI and IUPAC, out of which the first two allow a direct structure search, but IUPAC A. Gelbukh (Ed.): CICLing 2014, Part I, LNCS 8403, pp. 225–236, 2014. c Springer-Verlag Berlin Heidelberg 2014
226
U.K. Sikdar, A. Ekbal, and S. Saha
like names are more frequent in biochemical texts. Trivial chemical names can be easily found using a dictionary-based approach and can be subsequently mapped to their corresponding structures. In contrast it is not feasible to enumerate all IUPAC like names. Automatic identification of mentions of chemical compounds in text is of interest for a variety of reasons. This has potential application to the different text mining tasks that include but not limited to the predictions of drug-drug/protein-protein interactions, finding relations to adverse reactions of chemical compounds and their associations to toxicological endpoints or the extraction of pathway and metabolic reaction relations. It helps in semantic search by enabling the search engine to return documents containing elements of the e
Data Loading...