NestMSA: a new multiple sequence alignment algorithm
- PDF / 2,692,077 Bytes
- 21 Pages / 439.37 x 666.142 pts Page_size
- 91 Downloads / 255 Views
NestMSA: a new multiple sequence alignment algorithm Mohammed Kayed1 · Ahmed A. Elngar1
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Multiple sequence alignment (MSA) is a core problem in many applications. Various optimization algorithms such as genetic algorithm and particle swarm optimization (PSO) have been used to solve this problem, where all of them are adapted to work in the bioinformatics domain. This paper defines the MSA problem, suggests a novel MSA algorithm called ‘NestMSA’ and evaluates it in two domains: Web data extraction and removing different URLs with similar text (DUST). The suggested algorithm is inspired by the PSO optimization algorithm. It is not a generalization of a two-sequence alignment algorithm as it processes all the sequences at the same time. Therefore, it looks globally at the same time on all sequences. Different from other PSO-based alignment algorithms, swarm particles in the proposed NestMSA algorithm are nested inside the sequences and communicated together to align them. Therefore, global maximum is guaranteed in our algorithm. Furthermore, this work suggests a new objective function which both maximizes the number of matched characters and minimizes the number of gaps inserted in the sequences. The running time complexity and the efficiency of NestMSA are addressed in this paper. The experiments show an encouraging result as it outperforms the two approaches DCA and TEX in the Web data extraction domain (95% and 96% of recall and precision, respectively). Furthermore, it gives a high-performance result in the DUST domain (95%, 93% and 92% of recall, precision and SPS score, respectively). Keywords Web mining · Particle swarm optimization · Sequence alignment
* Ahmed A. Elngar [email protected] Mohammed Kayed [email protected] 1
Faculty of Computers and Artificial Intelligence, Beni-Suef University, Beni‑Suef, Egypt
13
Vol.:(0123456789)
M. Kayed, A. A. Elngar
1 Introduction Sequence (string) alignment is a common and a core problem in many fields such as information extraction, removing DUST, bioinformatics, NLP, financial data analysis and others. Aligning of characters in multiple data sequences is a challenge because these characters have many variations. The first variation is called missed characters/patterns, in which a character (a pattern of different consecutive characters) may be missed (has no occurrences) in a sequence. Gaps (referred by—in this paper) are inserted among the sequence characters in place of these missed characters. In the second variation, called repetitive characters/patterns, a character (a pattern) may have more than one occurrence in a sequence. The third variation is called multi-order characters/patterns. Characters (patterns) in this variation may have multiple ordering in different sequences. The fourth and last variation is the existence of what are called disjunctive (mutative) characters, in which a character (a pattern) in a sequence may appear as an alternative to another character
Data Loading...