Data Analysis in Rare Disease Diagnostics

  • PDF / 2,375,171 Bytes
  • 19 Pages / 595.276 x 790.866 pts Page_size
  • 48 Downloads / 202 Views

DOWNLOAD

REPORT


REVIEW ARTICLE

© Indian Institute of Science 2020.

Data Analysis in Rare Disease Diagnostics

Vamsi Veeramachaneni* Abstract | There are more than 8000 documented rare diseases in the world. While each disease is rare in itself, it is estimated that 1 in every 15 or 20 persons is affected by some rare disease. Most rare diseases are caused by just one or two small changes in the genome. Identifying the causative variant from the millions of variants that differentiate one person’s genome from another is a challenging task. In this article, we provide an overview of the data processing that takes place during the multi-stage rare disease diagnosis process. At each stage, we describe algorithms and methods that are in use in diagnostic laboratories and also describe how machine learning in general and deep learning in particular are improving the process. 1 Introduction A draft human genome covering ~ 95% of the human genome was first released in 2­ 0001. The sequence, commonly referred to as the human reference genome sequence, is a composite sequence created by sequencing and painstakingly assembling DNA obtained from anonymous volunteers of diverse backgrounds. This ~ 3 billion nucleotide-long genome sequence has undergone several revisions over the years and there are still small regions that have remained intractable. It is not an exaggeration to state that all clinical genomics applications today use the reference sequence as the basis for analysis. In this article, we focus on the topic of rare disease diagnosis through sequencing. There are over 8600 rare disease phenotypes documented in OMIM t­ oday2. The molecular basis for 6200 of these diseases has been traced to 3900 genes in the reference genome. Most rare diseases are caused by just one or two variants present in the patient genome. However, identifying the exact variants from among the more than 5 million small variants that distinguish any individual from the reference genome is an extremely challenging ­task3. There are four major steps in the rare disease diagnosis process—sequencing, variant detection, variant assessment, and variant prioritization. In this article, we take you through these steps explaining the data analysis that happens at each

J. Indian Inst. Sci. | VOL xxx:x | xxx–xxx 2020 | journal.iisc.ernet.in

step. We also describe emerging machine learning solutions that could have a bearing on how the step is carried out in a diagnostic setting in the future. 2 Sequencing Sequencing is the process of analyzing the DNA extracted from a sample and generating the nucleotide sequence that corresponds to it. In clinical genomics, the main goal of sequencing is to use the sequences to identify how the sample differs from the reference genome. We use the term variants to describe these differences. The implicit assumption is that some of the variants may help explain the cause for a disease or provide clues on the right treatment for the patient. Variants can broadly be classified into four categories. •  Substitution where a single base i