WBTC: a new approach for efficient storage of genomic data

PDF / 473,199 Bytes
7 Pages / 595.276 x 790.866 pts Page_size
5 Downloads / 241 Views

ORIGINAL RESEARCH

WBTC: a new approach for efficient storage of genomic data Sanjeev kumar1 • Suneeta Agarwal1 • Ranvijay1

Received: 21 February 2019 / Accepted: 9 May 2020 Ó Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020

Abstract With the improvement in high-throughput genome sequencing technology, huge amount of genomic data are generated every day. These data are used in numerous applications: sequence alignment, drug discovery and personalized medicine, etc. To efficiently handle genome data for storage, processing, and transmission, some specific genomic data compression approach is a need of today. In this paper, a hybrid approach-WBTC (Word Based Compression Technique) based on statistical and substitution model is proposed for genome compression. WBTC can support genomic data in raw forms as well as Fasta/Multi-fasta file formats. WBTC is a lossless genome compression algorithm in which searching is possible without full decompression. Experiments show that the proposed algorithm-WBTC outperforms in comparison to other state-of-the-art algorithms with respect to compression ratio, compression time, decompression time, compression memory and decompression memory. Keywords Genome Compression Fasta Multi Fasta Encoding Decoding

1 Introduction Next-generation sequencing technology produces lots of sequenced genomic data every day [1]. Size of these data is huge. Genomic data also have some unique characteristics such as repetitiveness and less no of bases (A/C/G/T/U) & Sanjeev kumar [email protected] 1

Department of Computer Science and Engineering, MNNIT Allahabad, Allahabad, India

[2]. To store, transfer and process these data efficiently, a compression technique is required, as compression reduces these costs drastically. General purpose compression algorithms are not suitable for these type of data, as they do not utilize the characteristics of biological sequence such as small alphabet size, large no of repeats and palindromic repeats [3]. Therefore there is a need to develop specialized genome compression algorithms utilizing these characteristics of biological sequences. Genomic data compression algorithms are categorized into naive bit encoding, dictionary-based, statistical and referential encoding [16]. In naive bit encoding, a fixed length encoding is used for each symbol [14, 15]. These algorithms are fast, but the compression ratio is not so good. Searching is also not possible without decompression. In dictionary-based algorithms, a dictionary is prepared for repeated sequences and then encoding is done based on this dictionary [4, 18]. The compression ratios of these algorithms are the same as of naive bit encoding. The disadvantage of such methods is the requirement of a dictionary during decoding [13]. In statistical or entropy encoding methods, a statistical model is prepared for input text, which predicts the next symbol in the text [9]. The compression ratio of these methods fully depends upon the reliability of prediction model [17]. In referenti

Data Loading...

WBTC: a new approach for efficient storage of genomic data

Recommend Documents

A New Efficient Approach in Clustering Ensembles

A New Approach for Dust Storm Detection Using MODIS Data

A New Approach for Dynamic and Risk-Based Data Anonymization

A New Approach for Data Security in Cryptography and Steganography

A Non-stochastic Method for Clustering of Big Genomic Data

A Novel Association Approach to Generate Patterns for Multi-valued Data in Efficient Data Classification

EDDAMAP: efficient data-dependent approach for monitoring asymptomatic patient

Genomic Approach to Asthma

An Efficient and Metadata-Aware Big Data Storage Architecture

Nanotechnology for Data Storage Applications

Data Storage

Deficiencies of Compliancy for Data and Storage