A unified view of density-based methods for semi-supervised clustering and classification
- PDF / 2,281,609 Bytes
- 59 Pages / 439.37 x 666.142 pts Page_size
- 56 Downloads / 162 Views
A unified view of density-based methods for semi-supervised clustering and classification Jadson Castro Gertrudes1 · Arthur Zimek2 · Jörg Sander3 · Ricardo J. G. B. Campello4 Received: 19 August 2018 / Accepted: 8 August 2019 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019
Abstract Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering. Keywords Semi-supervised classification · Semi-supervised clustering · Density-based clustering
1 Introduction Semi-supervised learning algorithms tackle cases where a relatively small amount of labeled data yet a large amount of unlabeled data is available for training (Chapelle et al. 2006; Zhu and Goldberg 2009). We find examples of semi-supervised
Responsible editor: Ian Davidson.
B
Jadson Castro Gertrudes [email protected]
Extended author information available on the last page of the article
123
J. C. Getrudes et al.
learning scenarios in various fields, such as email filtering, sound/speech recognition, text/webpage classification, and compound discovery, just to mention a few. For instance, in areas such as biology, chemistry, and medicine, domain experts and laboratory analyses may be required to label observations, thus only a small collection of labeled data can usually be afforded, which may not be representative enough for supervised learning to be applied (Batista et al. 2016). Typically, semi-supervised learning algorithms are based on extensions of either supervised or unsupervised algorithms by including additional information in the form originally handled by the other learning paradigm. For instance, in semi-supervised clustering, a collection of labeled observations can be used to guide the (otherwise unsupervised) search for clustering solutions that better meet users’ prior expect
Data Loading...