Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

  • PDF / 1,117,346 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 73 Downloads / 227 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Cross‑Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi‑Label Datasets Adane Nega Tarekegn1 · Krzysztof Michalak2 · Mario Giacobini3 Received: 13 March 2020 / Accepted: 30 July 2020 © Springer Nature Singapore Pte Ltd 2020

Abstract Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assigning new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predicting cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets. Keywords  Clustering validation · Clustering analysis · Cross-validation · Multi-label data

Introduction Overview of Unsupervised Learning Unsupervised learning aims to find the underlying structure or the distribution of data. It is an important area in the domain of machine learning, where the labels for the data examples are not necessarily required for model building. * Adane Nega Tarekegn [email protected] Krzysztof Michalak [email protected] Mario Giacobini [email protected] 1



Modelling and Data Science, Department of Mathematics, University of Turin, Turin, Italy

2



Department of Information Technologies, Wroclaw University of Economics, Wroclaw, Poland

3

Data Analysis and Modeling Unit, Department of Veterinary Sciences, University of Turin, Turin, Italy



The main tasks in unsupervised learning include cluster analysis [40, 42], building self-organizing maps (SOM) [21], representation learning [2], and density estimation [31]. Cluster analysis, the main focus of this study, is a central task for grouping heterogeneous data points into a number of more homogenous subgroups based on distance, or naturally occurring trends, patterns, and relationships in the data. The formation of homogenous or heteroge