NestMSA: a new multiple sequence alignment algorithm

PDF / 2,692,077 Bytes
21 Pages / 439.37 x 666.142 pts Page_size
91 Downloads / 337 Views

NestMSA: a new multiple sequence alignment algorithm Mohammed Kayed1 · Ahmed A. Elngar1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Multiple sequence alignment (MSA) is a core problem in many applications. Various optimization algorithms such as genetic algorithm and particle swarm optimization (PSO) have been used to solve this problem, where all of them are adapted to work in the bioinformatics domain. This paper defines the MSA problem, suggests a novel MSA algorithm called ‘NestMSA’ and evaluates it in two domains: Web data extraction and removing different URLs with similar text (DUST). The suggested algorithm is inspired by the PSO optimization algorithm. It is not a generalization of a two-sequence alignment algorithm as it processes all the sequences at the same time. Therefore, it looks globally at the same time on all sequences. Different from other PSO-based alignment algorithms, swarm particles in the proposed NestMSA algorithm are nested inside the sequences and communicated together to align them. Therefore, global maximum is guaranteed in our algorithm. Furthermore, this work suggests a new objective function which both maximizes the number of matched characters and minimizes the number of gaps inserted in the sequences. The running time complexity and the efficiency of NestMSA are addressed in this paper. The experiments show an encouraging result as it outperforms the two approaches DCA and TEX in the Web data extraction domain (95% and 96% of recall and precision, respectively). Furthermore, it gives a high-performance result in the DUST domain (95%, 93% and 92% of recall, precision and SPS score, respectively). Keywords Web mining · Particle swarm optimization · Sequence alignment

* Ahmed A. Elngar [email protected] Mohammed Kayed [email protected] 1

Faculty of Computers and Artificial Intelligence, Beni-Suef University, Beni‑Suef, Egypt

13

Vol.:(0123456789)

M. Kayed, A. A. Elngar

1 Introduction Sequence (string) alignment is a common and a core problem in many fields such as information extraction, removing DUST, bioinformatics, NLP, financial data analysis and others. Aligning of characters in multiple data sequences is a challenge because these characters have many variations. The first variation is called missed characters/patterns, in which a character (a pattern of different consecutive characters) may be missed (has no occurrences) in a sequence. Gaps (referred by—in this paper) are inserted among the sequence characters in place of these missed characters. In the second variation, called repetitive characters/patterns, a character (a pattern) may have more than one occurrence in a sequence. The third variation is called multi-order characters/patterns. Characters (patterns) in this variation may have multiple ordering in different sequences. The fourth and last variation is the existence of what are called disjunctive (mutative) characters, in which a character (a pattern) in a sequence may appear as an alternative to another character

Data Loading...

NestMSA: a new multiple sequence alignment algorithm

Recommend Documents

Multiple Sequence Alignment

Multiple Sequence Alignment Methods

Multiple Sequence Alignment

An ant colony algorithm for multiple sequence alignment in bioinformatics

Revisiting Evaluation of Multiple Sequence Alignment Methods

Multiple Sequence Alignment Methods and Protocols

A Novel Multiple Sequence Alignment Algorithm Based on Artificial Bee Colony and Particle Swarm Optimization

A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm

Sequence Alignment

Multiple Sequence Alignment Computation Using the T-Coffee Regressive Algorithm Implementation

Fast and Accurate Multiple Sequence Alignment with MSAProbs-MPI

lamassemble: Multiple Alignment and Consensus Sequence of Long Reads