BiobankCloud: A Platform for the Secure Storage, Sharing, and Processing of Large Biomedical Data Sets

Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and p

  • PDF / 1,467,054 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 67 Downloads / 218 Views

DOWNLOAD

REPORT


5

KTH - Royal Institute of Technology, Stockholm, Sweden {jdowling,gholami,mahh,maism,erwinl,smkniazi}@kth.se 2 Humboldt-Universität zu Berlin, Berlin, Germany {joergen.brandt,bux,leser}@informatik.hu-berlin.de 3 Karolinska Institute, Solna, Sweden {Jan-Eric.Litton,Roxanna.Martinez}@ki.se 4 Charite, Berlin, Germany {Lora.Dimitrova,Michael.Hummel,Karin.Zimmermann}@charite.de LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal {bessani,vielmo}@lasige.di.fc.ul.pt 6 Uppsala University, Uppsala, Sweden [email protected]

Abstract. Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientific workflow engine featuring a proper workflow definition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataflows. Our platform also supports the secure sharing of data across different, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-defined workflows for popular tasks in biomedical data analysis, such as variant identification, differential transcriptome analysis using RNASeq, and analysis of miRNA-Seq and ChIP-Seq data. c Springer International Publishing Switzerland 2016  F. Wang et al. (Eds.): Big-O(Q) and DMAH 2015, LNCS 9579, pp. 89–105, 2016. DOI: 10.1007/978-3-319-41576-5_7

90

1

A. Bessani et al.

Introduction

Biobanks store and catalog human biological material from identifiable individuals for both clinical and research purposes. Recent initiatives in personalized medicine created a steeply increasing demand to sequence the human biological material stored in biobanks. As of 2015, such large-scale sequencing is under way in hundreds of projects around the world, with the largest single project sequencing up to 100.000 genomes1 . Furthermore, sequencing also is becoming more and more routine in a clinical setting for improving diagnosis and therapy especially in cancer [1]. However, software systems for biobanks traditionally managed only metadata associated with samples, such as pseudo-identifiers for patients, sample collection information, or study information. Such systems cannot cope with the current requirement to, alongside such metadata, also