Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

  • PDF / 2,518,641 Bytes
  • 14 Pages / 595 x 791 pts Page_size
  • 82 Downloads / 172 Views

DOWNLOAD

REPORT


SOFTWAR E

Open Access

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework Tanveer Ahmad1* , Nauman Ahmed1 , Zaid Al-Ars1* and H. Peter Hofstee1,2 From The 18th Asia Pacific Bioinformatics Conference Seoul, Korea. 18–20 August 2020

Abstract Background: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWAMEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to (Continued on next page)

*Correspondence: [email protected]; [email protected] Accelerated Big Data Systems Group, Quantum & Computer Engineering Department, Delft University of Technology, Delft The Netherlands Full list of author information is available at the end of the article 1

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,