An efficient evolutionary algorithm with a nearest neighbor search technique for clustering analysis

  • PDF / 6,853,305 Bytes
  • 26 Pages / 595.276 x 790.866 pts Page_size
  • 7 Downloads / 205 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

An efficient evolutionary algorithm with a nearest neighbor search technique for clustering analysis Raneem Qaddoura1 · Hossam Faris2 · Ibrahim Aljarah2  Received: 4 July 2019 / Accepted: 22 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Evolutionary algorithms have shown their powerful capabilities in different machine learning problems including clustering which is a growing area of research nowadays. In this paper, we propose an efficient clustering technique based on the evolution behavior of genetic algorithm and an advanced variant of nearest neighbor search technique based on assignment and election mechanisms. The goal of the proposed algorithm is to improve the quality of clustering results by finding a solution that maximizes the separation between different clusters and maximizes the cohesion between data points in the same cluster. Our proposed algorithm which we refer to as “EvoNP” is tested with 15 well-known data sets using 5 well-known external evaluation measures and is compared with 7 well-regarded clustering algorithms . The experiments are conducted in two phases: evaluation of the best fitness function for the algorithm and evaluation of the algorithm against other clustering algorithms. The results show that the proposed algorithm works well with silhouette coefficient fitness function and outperforms the other algorithms for the majority of the data sets. The source code of EvoNP is available at http://evo-ml.com/evonp​/. Keywords  Cluster analysis · Clustering · Nearest neighbor search · Evolutionary algorithms · Optimization algorithms · Nature-inspired algorithms

1 Introduction “Clustering” or “cluster analysis” is an unsupervised learning task where the true classes are not available. It aims at grouping similar data points to the same cluster and dissimilar data points to different clusters (Özbakır and Turna 2017). This results in maximizing the separation between different clusters and the cohesion between data points in the same cluster. The similarity between data points can be expressed by distance metrics. Euclidean distance is one of the distance metrics that are commonly used for this purpose (Anton 2013), which is considered in this study. * Ibrahim Aljarah [email protected] Raneem Qaddoura [email protected] Hossam Faris [email protected] 1



Information Technology, Philadelphia University, Amman, Jordan



King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan

2

Clustering is used for data analysis to gain insights about the data by splitting the data points into groups. These insights are usually used for decision making and future reasoning. For example, clustering is used in customer-oriented companies where analyzing the data generated from these customers helps get future insights for better decision making (Ozyirmidokuz et al. 2015). It is also used in a wide set of applications such as image processing (Kumar et al. 2018), information retrieval (Sharma