Distribution on Contingency of Alignment of Two Literal Sequences Under Constrains
- PDF / 840,550 Bytes
- 15 Pages / 439.37 x 666.142 pts Page_size
- 71 Downloads / 142 Views
Distribution on Contingency of Alignment of Two Literal Sequences Under Constrains Lorentz Ja¨ntschi • Sorana D. Bolboaca˘
Received: 1 July 2014 / Accepted: 5 December 2014 / Published online: 19 December 2014 Springer Science+Business Media Dordrecht 2014
Abstract The case of ungapped alignment of two literal sequences under constrains is considered. The analysis lead to general formulas for probability mass function and cumulative distribution function for the general case of using an alphabet with a chosen number of letters (e.g. 4 for deoxyribonucleic acid sequences) in the expression of the literal sequences. Formulas for three statistics including mean, mode, and standard deviation were obtained. Distributions are depicted for three important particular cases: alignment on binary sequences, alignment of trinomial series (such as coming from generalized Kronecker delta), and alignment of genetic sequences (with four literals in the alphabet). A particular case when sequences contain each letter of the alphabet at least once in both sequences has also been analyzed and some statistics for this restricted case are given.
L. Ja¨ntschi Department of Physics and Chemistry, Technical University of Cluj-Napoca, 103-105 Muncii Bvd., 400641 Cluj-Napoca, Romania e-mail: [email protected] L. Ja¨ntschi Institute for Doctoral Studies, Babes¸ -Bolyai University, 1st Mihail Kogalniceanu Street, 400084 Cluj-Napoca, Romania L. Ja¨ntschi S. D. Bolboaca˘ (&) University of Agricultural Science and Veterinary Medicine Cluj-Napoca, 3-5 Calea Ma˘na˘s¸ tur, 400372 Cluj-Napoca, Romania e-mail: [email protected] L. Ja¨ntschi Department of Chemistry, The University of Oradea, 1st Universita˘¸t ii Street, 410087 Oradea, Romania S. D. Bolboaca˘ Department of Medical Informatics and Biostatistics, Iuliu Hat¸ ieganu University of Medicine and Pharmacy, 6 Louis Pasteur, 400349 Cluj-Napoca, Romania
123
56
L. Ja¨ntschi, S. D. Bolboaca˘
Keywords Alignment Contingency matrix Probability mass function (PMF) Cumulative distribution function (CDF)
1 Introduction Researches related to sequence alignments are frequently done due to the huge amount of already identified sequence of DNA (deoxyribonucleic acid), RNA (ribonucleic acid), or proteins (Pruitt et al. 2012). Sequence alignment is defined as a way of arrange DNA, RNA (Allali et al. 2012), or amino acid (Mongiovı` and Sharan 2013) sequences to identify similar regions that could reflect functional, structural or evolutionary relationships between sequences (Mount 2004). Several algorithms were developed and implemented for global (Rahrig et al. 2013; Szalkowski and Anisimova 2013) or local alignments (Phuong et al. 2006; Tabei and Asai 2009; Frith et al. 2010), each algorithm with certain advantages and disadvantages. For example, the approach proposed by Szalkowski and Anisimova (2013) detect insertions and deletions of TR (tandem repeats) units not restricted to TR unit boundaries and proved more performing (*10 %) compared to other aligners for cases with diverge
Data Loading...