An Improved Algorithm for MicroRNA Profiling from Next Generation Sequencing Data

Next Generation Sequencing(NGS) is a massively parallel, low cost method capable of sequencing millions of fragments of DNA from a sample. Consequently, huge quantity of data generated and new research challenges to address storage, retrieval and processi

  • PDF / 680,452 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 34 Downloads / 173 Views

DOWNLOAD

REPORT


College of Engineering Trivandrum, Thiruvananthapuram, Kerala, India [email protected] 2 Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India 3 Computer Center, University of Kerala, Thiruvananthapuram, India

Abstract. Next Generation Sequencing(NGS) is a massively parallel, low cost method capable of sequencing millions of fragments of DNA from a sample. Consequently, huge quantity of data generated and new research challenges to address storage, retrieval and processing of these bulk of data were emerged. microRNAs are non coding RNA sequences of around 18 to 24 nucleotides in length. microRNA expression profiling is a measure of relative abundance of microRNA sequences in a sample. This paper discusses algorithms for pre-processing of reads and a faster Bit Parallel Profiling (BPP) algorithm to quantify microRNAs. Experimental results shows that adapter removal has been accomplished with an accuracy of 91.2 %, a sensitivity of 89.5 % and a specificity of 89.5 %. In the case of profiling, BPP outperform an existing tool, Bowtie in terms of speed of operation.

1

Introduction

Nucleic acid sequencing is the process of finding exact order of nucleotides present in a given DNA or RNA molecule. First major endeavor in DNA sequencing was Human Genome Project. This was a 13 year long project completed in the year 2003 and method employed was Sanger sequencing. The demand for faster and cheaper alternative lead to the development of Next Generation Sequencing(NGS) [1]. NGS platforms are massively parallel, low cost and high throughput sequencing methods. The Life Technologies Ion Torrent Personal Genome Machine(PGM), Illumina: HiSeq, Roche: GS Flx+ or 454 and ABI: SOLiD are examples of NGS platforms. RNA-Seq is a NGS technique developed to analyze transcripts such as mRNAs, small RNAs and non-coding RNAs [2]. Approximately, length ranges from 400 base pairs for longer reads to 30 base pairs for shorter reads. NGS data analysis is a challenging big data analysis task as millions reads are to be pre-processed and aligned to genome or assembled to transcriptome before performing the required down stream analysis. The following sections discuss steps involved in and algorithm employed for NGS data processing. c Springer International Publishing Switzerland 2016  Y. Tan and Y. Shi (Eds.): DMBD 2016, LNCS 9714, pp. 38–47, 2016. DOI: 10.1007/978-3-319-40973-3 4

An Improved Algorithm for MicroRNA Profiling

1.1

39

Pre-processing

During the library preparation, adapter sequence or fragments of adapter sequence are added to a read. Adapters are not part of biological sequences and needs to be removed before further processing of reads. Otherwise, it may lead to missed alignments or discarding of genuine match and finally results in wrong analysis. A sequence read in f astq format, consists of four lines- (i) sequence identifier (ii) actual sequence (iii) quality score identifier (iv) quality score. The quality score is a measure of error probability associated with ea