Identification and Clinical Translation of Biomarker Signatures: Statistical Considerations

Powerful machine learning tools exist to extract biological patterns for diagnosis or prediction from high-dimensional datasets. Simultaneous advances in high-throughput profiling technologies have led to a rapid acceleration of biomarker discovery invest

  • PDF / 249,033 Bytes
  • 12 Pages / 504.57 x 720 pts Page_size
  • 5 Downloads / 226 Views

DOWNLOAD

REPORT


1

Introduction Technological advancements in high-throughput technology have tremendously accelerated the search for biological patterns that have clinical utility for diagnosis and prediction. Among these are multiplexed assays that facilitate simultaneous measurement of analytes in small sample volumes, with high-throughput and low variability often comparable to the single-plex gold standard methodology. The developments in these approaches have been paralleled by a tremendous increase in application of multivariate methods to identify biological signatures within the generated high-dimensional datasets, a process that has been accelerated by the availability of complex algorithms in standard software packages. These facilitate the extraction of complex biological patterns from high-dimensional data that can already be transferred efficiently into dedicated multiplexed measurement systems. To aid in this process, computational pipelines have been developed that support translation from study design and initial biomarker

Paul C. Guest (ed.), Multiplex Biomarker Techniques: Methods and Applications, Methods in Molecular Biology, vol. 1546, DOI 10.1007/978-1-4939-6730-8_6, © Springer Science+Business Media LLC 2017

103

104

Emanuel Schwarz

screening to clinically applicable multiplexed tests [1]. However, such transferability is platform dependent and results from highthroughput profiling within a research setting may not be easily transferred into clinically usable assay systems [2]. Despite these technological advances, the translation of biomarker candidates into clinical tests has been slow. For example in the cancer field, many biomarker candidates that showed promise during initial stages of the development process did not turn out to be clinically useful, and this was frequently not realized until the later stages of the process [3]. While there are numerous aspects that affect the discovery and clinical translation of biomarker signatures, including clinical, methodological, and regulatory challenges [4–8], this chapter will focus on statistical considerations regarding the optimal identification of biomarker sets. The computational identification of biological signatures typically falls within the realm of supervised machine learning, where a given part of labeled data is used for “training,” to select an “optimal” combination of measurements for a predefined classification or prediction task. The algorithm is subsequently tested in a part of the data not used during training and the accuracy of the prediction in this test data is used as an estimation of how the classifier will perform in future, independent datasets. In practice, training and testing is usually performed by splitting datasets into a training and a test set, by cross-validation or similar techniques. In crossvalidation, the data is split randomly into a given number of junks and each of these junks is used as a test set until all subjects have been classified once. Estimates of classifier accuracy are then determined across the enti