Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

  • PDF / 1,510,214 Bytes
  • 7 Pages / 595.276 x 790.866 pts Page_size
  • 70 Downloads / 179 Views

DOWNLOAD

REPORT


pen Access

SOFTWARE

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization Lili Blumenberg1,2 and Kelly V. Ruggles1,2* 

*Correspondence: kelly.ruggles@nyulangone. org 1 Institute of Systems Genetics, New York University Grossman School of Medicine, New York, NY 10016, USA Full list of author information is available at the end of the article

Abstract  Background:  Unsupervised clustering is a common and exceptionally useful tool for large biological datasets. However, clustering requires upfront algorithm and hyperparameter selection, which can introduce bias into the final clustering labels. It is therefore advisable to obtain a range of clustering results from multiple models and hyperparameters, which can be cumbersome and slow. Results:  We present hypercluster, a python package and SnakeMake pipeline for flexible and parallelized clustering evaluation and selection. Users can efficiently evaluate a huge range of clustering results from multiple models and hyperparameters to identify an optimal model. Conclusions:  Hypercluster improves ease of use, robustness and reproducibility for unsupervised clustering application for high throughput biology. Hypercluster is available on pip and bioconda; installation, documentation and example workflows can be found at: https​://githu​b.com/ruggl​eslab​/hyper​clust​er. Keywords:  Machine learning, Unsupervised clustering, Hyperparameter optimization, Scikit-learn, Python, SnakeMake

Background Unsupervised clustering is ubiquitously used for the interpretation of ‘omics datasets [1–7]. Clustering is a particularly central challenge in the analysis of single-cell measurement data (e.g. single cell RNA-seq) due to its high dimensionality [8–10]. Clustering is also increasingly being used for disease subtype classification and risk stratification [11–19]. It is therefore essential that optimal clustering results are easily and robustly obtainable, without user-selected hyperparameters introducing bias and impeding rapid analysis. Clustering is inherently under-defined [20–22]. The definition of “cluster” differs from problem to problem and the desired goal of the analysis [14], and therefore it is not possible to make a single algorithm or metric that can universally identify the “best” clusters [23]. Researchers therefore often compare results from multiple algorithms and hyperparameters [7, 24–28]. Typically, the effect of hyperparameter choice on the quality of clustering results cannot be described with a convex function, meaning that

© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a