BiobankCloud: A Platform for the Secure Storage, Sharing, and Processing of Large Biomedical Data Sets

Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and p

PDF / 1,467,054 Bytes
17 Pages / 439.37 x 666.142 pts Page_size
67 Downloads / 271 Views

DOWNLOAD

REPORT

5

KTH - Royal Institute of Technology, Stockholm, Sweden {jdowling,gholami,mahh,maism,erwinl,smkniazi}@kth.se 2 Humboldt-Universität zu Berlin, Berlin, Germany {joergen.brandt,bux,leser}@informatik.hu-berlin.de 3 Karolinska Institute, Solna, Sweden {Jan-Eric.Litton,Roxanna.Martinez}@ki.se 4 Charite, Berlin, Germany {Lora.Dimitrova,Michael.Hummel,Karin.Zimmermann}@charite.de LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal {bessani,vielmo}@lasige.di.fc.ul.pt 6 Uppsala University, Uppsala, Sweden [email protected]

Abstract. Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientiﬁc workﬂow engine featuring a proper workﬂow deﬁnition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataﬂows. Our platform also supports the secure sharing of data across diﬀerent, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-deﬁned workﬂows for popular tasks in biomedical data analysis, such as variant identiﬁcation, diﬀerential transcriptome analysis using RNASeq, and analysis of miRNA-Seq and ChIP-Seq data. c Springer International Publishing Switzerland 2016 F. Wang et al. (Eds.): Big-O(Q) and DMAH 2015, LNCS 9579, pp. 89–105, 2016. DOI: 10.1007/978-3-319-41576-5_7

90

1

A. Bessani et al.

Introduction

Biobanks store and catalog human biological material from identiﬁable individuals for both clinical and research purposes. Recent initiatives in personalized medicine created a steeply increasing demand to sequence the human biological material stored in biobanks. As of 2015, such large-scale sequencing is under way in hundreds of projects around the world, with the largest single project sequencing up to 100.000 genomes1 . Furthermore, sequencing also is becoming more and more routine in a clinical setting for improving diagnosis and therapy especially in cancer [1]. However, software systems for biobanks traditionally managed only metadata associated with samples, such as pseudo-identiﬁers for patients, sample collection information, or study information. Such systems cannot cope with the current requirement to, alongside such metadata, also

Data Loading...

BiobankCloud: A Platform for the Secure Storage, Sharing, and Processing of Large Biomedical Data Sets

Recommend Documents

A multicentric IT platform for storage and sharing of imaging-based radiation dosimetric data

A Secure Data Sharing Using IDSS CP-ABE in Cloud Storage

Biomedical Data: Their Acquisition, Storage, and Use

A Platform for Massive Railway Information Data Storage

Limitations and Perspectives of Optically Switched Interconnects for Large-scale Data Processing and Storage Systems

Biomedical Image Data Types and Processing

Biomedical Scientific Textual Data Types and Processing

Online information leaker identification scheme for secure data sharing

Secure Distributed Queries over Large Sets of Personal Home Boxes

Isabl Platform, a digital biobank for processing multimodal patient data

Blockchain Technology for Data Sharing in Decentralized Storage System

3D Data Representation, Storage and Processing