WBTC: a new approach for efficient storage of genomic data
- PDF / 473,199 Bytes
- 7 Pages / 595.276 x 790.866 pts Page_size
- 5 Downloads / 230 Views
ORIGINAL RESEARCH
WBTC: a new approach for efficient storage of genomic data Sanjeev kumar1 • Suneeta Agarwal1 • Ranvijay1
Received: 21 February 2019 / Accepted: 9 May 2020 Ó Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020
Abstract With the improvement in high-throughput genome sequencing technology, huge amount of genomic data are generated every day. These data are used in numerous applications: sequence alignment, drug discovery and personalized medicine, etc. To efficiently handle genome data for storage, processing, and transmission, some specific genomic data compression approach is a need of today. In this paper, a hybrid approach-WBTC (Word Based Compression Technique) based on statistical and substitution model is proposed for genome compression. WBTC can support genomic data in raw forms as well as Fasta/Multi-fasta file formats. WBTC is a lossless genome compression algorithm in which searching is possible without full decompression. Experiments show that the proposed algorithm-WBTC outperforms in comparison to other state-of-the-art algorithms with respect to compression ratio, compression time, decompression time, compression memory and decompression memory. Keywords Genome Compression Fasta Multi Fasta Encoding Decoding
1 Introduction Next-generation sequencing technology produces lots of sequenced genomic data every day [1]. Size of these data is huge. Genomic data also have some unique characteristics such as repetitiveness and less no of bases (A/C/G/T/U) & Sanjeev kumar [email protected] 1
Department of Computer Science and Engineering, MNNIT Allahabad, Allahabad, India
[2]. To store, transfer and process these data efficiently, a compression technique is required, as compression reduces these costs drastically. General purpose compression algorithms are not suitable for these type of data, as they do not utilize the characteristics of biological sequence such as small alphabet size, large no of repeats and palindromic repeats [3]. Therefore there is a need to develop specialized genome compression algorithms utilizing these characteristics of biological sequences. Genomic data compression algorithms are categorized into naive bit encoding, dictionary-based, statistical and referential encoding [16]. In naive bit encoding, a fixed length encoding is used for each symbol [14, 15]. These algorithms are fast, but the compression ratio is not so good. Searching is also not possible without decompression. In dictionary-based algorithms, a dictionary is prepared for repeated sequences and then encoding is done based on this dictionary [4, 18]. The compression ratios of these algorithms are the same as of naive bit encoding. The disadvantage of such methods is the requirement of a dictionary during decoding [13]. In statistical or entropy encoding methods, a statistical model is prepared for input text, which predicts the next symbol in the text [9]. The compression ratio of these methods fully depends upon the reliability of prediction model [17]. In referenti
Data Loading...