Data-Driven Statistical Approaches for Omics Data Analysis

With the rapid development of high-throughput technology, various omics data for biological systems increases exponentially. A challenge problem for biologists is how to explore useful bioinformatics from high-dimensional or ultrahigh-dimensional omics da

  • PDF / 872,186 Bytes
  • 31 Pages / 439.36 x 666.15 pts Page_size
  • 96 Downloads / 227 Views

DOWNLOAD

REPORT


Data-Driven Statistical Approaches for Omics Data Analysis

Abstract With the rapid development of high-throughput technology, various omics data for biological systems increases exponentially. A challenge problem for biologists is how to explore useful bioinformatics from high-dimensional or ultrahigh-dimensional omics data. In this chapter, we introduce some recent progresses on the topic of omics data analysis, paying special attention on the related data-driven statistical approaches. Especially, the weighted gene co-expression network analysis, the genome-wide association study, the general linear models, and the hidden Markov random field model will be introduced.

9.1 Backgrounds 9.1.1 Various High-Throughput Sequencing Technologies With the rapid development of high-throughput technology, various omics data for biological systems increases exponentially [1–17]. The human genome sequence was completed in draft form in 2001 [18, 19]. Shortly thereafter, the genome sequences of several model organisms were determined [20–22]. These feats were accomplished with Sanger DNA sequencing, which was limited in throughput and high cost. Commercially available high-throughput sequencing (HTS) platforms (Figs. 9.1 and 9.2) that have improved the traditional Sanger sequencing include the following: (1) The Illumina Genome Analyzer II that was released by Illumina/Solexa in 2006. Illumina currently has produced a suite of sequencers (MiSeq, NextSeq 500, and the HiSeq series) optimized for a variety of throughputs and turnaround times. In early 2014, Illumina introduced the NextSeq 500 as well as the HiSeq X Ten. The NextSeq 500 is designed as a fast benchtop sequencer for individual labs, while the HiSeq X Ten is a population-scale whole-genome sequencing (WGS) system. (2) Life Technologies commercialized Ion Torrent’s semiconductor sequencing technology in 2010 in the form of the benchtop Ion PGM sequencer. The template preparation and sequencing steps are conceptually similar to the Roche/454 pyrosequencing platform [23]. (3) Single-molecule real-time (SMRT) sequencing © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 J. Lü, P. Wang, Modeling and Analysis of Bio-molecular Networks, https://doi.org/10.1007/978-981-15-9144-0_9

429

430

9 Data-Driven Statistical Approaches for Omics Data Analysis

10,000,000

1,000,000 Complete Genomics

Machine output (Mb)

125

100

35 ABI SOLiD Intelligent Illumina 5500xl Bio-Systems MAX-Seq GAIIx 150 Illumina 75 55 GAII 50 35 32 ABI SOLiD 3

100,000 30x human genome

100x human exome

ABI SOLiD

13 300 Polonator Illumina G.007 Ion Torrent MiSeq Ion PGM Roche/454 400 GS FLX+ 14k 800 Pacific Bioscience RSII

35 35

1,000

Solexa/Illumina sequence analyzer

100

150

150 ABI SOLiD 5500xl W Illumina HiSeq 3000 75 150 Illumina NextSeq 500

Helicos Heliscope

10,000

Illumina HiSeq X Ten

Illumina HiSeq 2500

Illumina Hi-Seq 2000

200 Ion Torrent Ion Proton

Oxford Nanopore MinION

Roche/454 GS Junior

454 GS-20 pyroseque