Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

PDF / 1,510,214 Bytes
7 Pages / 595.276 x 790.866 pts Page_size
70 Downloads / 289 Views

pen Access

SOFTWARE

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization Lili Blumenberg1,2 and Kelly V. Ruggles1,2*

*Correspondence: kelly.ruggles@nyulangone. org 1 Institute of Systems Genetics, New York University Grossman School of Medicine, New York, NY 10016, USA Full list of author information is available at the end of the article

Abstract Background: Unsupervised clustering is a common and exceptionally useful tool for large biological datasets. However, clustering requires upfront algorithm and hyperparameter selection, which can introduce bias into the final clustering labels. It is therefore advisable to obtain a range of clustering results from multiple models and hyperparameters, which can be cumbersome and slow. Results: We present hypercluster, a python package and SnakeMake pipeline for flexible and parallelized clustering evaluation and selection. Users can efficiently evaluate a huge range of clustering results from multiple models and hyperparameters to identify an optimal model. Conclusions: Hypercluster improves ease of use, robustness and reproducibility for unsupervised clustering application for high throughput biology. Hypercluster is available on pip and bioconda; installation, documentation and example workflows can be found at: https://github.com/ruggleslab/hypercluster. Keywords: Machine learning, Unsupervised clustering, Hyperparameter optimization, Scikit-learn, Python, SnakeMake

Background Unsupervised clustering is ubiquitously used for the interpretation of ‘omics datasets [1–7]. Clustering is a particularly central challenge in the analysis of single-cell measurement data (e.g. single cell RNA-seq) due to its high dimensionality [8–10]. Clustering is also increasingly being used for disease subtype classification and risk stratification [11–19]. It is therefore essential that optimal clustering results are easily and robustly obtainable, without user-selected hyperparameters introducing bias and impeding rapid analysis. Clustering is inherently under-defined [20–22]. The definition of “cluster” differs from problem to problem and the desired goal of the analysis [14], and therefore it is not possible to make a single algorithm or metric that can universally identify the “best” clusters [23]. Researchers therefore often compare results from multiple algorithms and hyperparameters [7, 24–28]. Typically, the effect of hyperparameter choice on the quality of clustering results cannot be described with a convex function, meaning that

© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a

Data Loading...

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Recommend Documents

K-means tree: an optimal clustering tree for unsupervised learning

Optimization of Electrodeformational Surfacing by a Flexible Tool on the Basis of Experimental Design

Swarm optimization clustering methods for opinion mining

A Multi-level Equilibrium Clustering Approach for Unsupervised Person Re-identification

Unsupervised Feature Analysis with Class Margin Optimization

TARGET_TIA: A Complete, Flexible and Sound Territorial Impact Assessment Tool

Silhouette Index as Clustering Evaluation Tool

Video trajectory analysis using unsupervised clustering and multi-criteria ranking

Electroencephalogram Lifting Recognition Using Unsupervised Gray-Based Competitive Clustering Networks

Unsupervised Visual Time-Series Representation Learning and Clustering

Image Clustering by Generative Adversarial Optimization and Advanced Clustering Criteria

Evolving Rule-Based Models A Tool for Design of Flexible Adaptive Sy