Improved metaheuristics for the quartet method of hierarchical clustering

  • PDF / 861,215 Bytes
  • 30 Pages / 439.37 x 666.142 pts Page_size
  • 76 Downloads / 220 Views

DOWNLOAD

REPORT


Improved metaheuristics for the quartet method of hierarchical clustering Sergio Consoli1

· Jan Korst1 · Steffen Pauws1,2 · Gijs Geleijnse3

Received: 27 December 2018 / Accepted: 30 December 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract The quartet method is a novel hierarchical clustering approach where, given a set of n data objects and their pairwise dissimilarities, the aim is to construct an optimal tree from the total number of possible combinations of quartet topologies on n, where optimality means that the sum of the dissimilarities of the embedded (or consistent) quartet topologies is minimal. This corresponds to an NP-hard combinatorial optimization problem, also referred to as minimum quartet tree cost (MQTC) problem. We provide details and formulation of this challenging problem, and propose a basic greedy heuristic that is characterized by some appealing insights and findings for speeding up and simplifying the processes of solution generation and evaluation, such as the use of adjacency-like matrices to represent the topology structures of candidate solutions; fast calculation of coefficients and weights of the solution matrices; shortcuts in the enumeration of all solution permutations for a given configuration; and an iterative distance matrix reduction procedure, which greedily merges together highly connected objects which may bring lower values of the quartet cost function in a given partial solution. It will be shown that this basic greedy heuristic is able to improve consistently the performance of popular quartet clustering algorithms in the literature, namely a reduced variable neighbourhood search and a simulated annealing metaheuristic, producing novel efficient solution approaches to the MQTC problem. Keywords Combinatorial optimization · NP-hardness · Quartet trees · Hierarchical clustering · Data mining · Metaheuristics · Graphs · Minimum quartet tree cost

B

Sergio Consoli [email protected] Jan Korst [email protected] Steffen Pauws [email protected]

1

Philips Research, High Tech Campus 34, 5656 AE Eindhoven, The Netherlands

2

TiCC, Tilburg University, Warandelaan 2, 5037 AB Tilburg, The Netherlands

3

Netherlands Comprehensive Cancer Organisation (IKNL), Zernikestraat 29, 5612 HZ Eindhoven, The Netherlands

123

Journal of Global Optimization

1 Introduction and problem formulation The quartet method is a recent hierarchical clustering methodology popular in the context of biological phylogenesis data [6,8], which relies on global optimization of the constructed clustering tree without any a priori knowledge on the total number of clusters to be produced, in contrast to classical bottom-up or bottom-down hierarchical clustering approaches, such as quartet puzzling, neighborhood joining, and the like [6]. This novel hierarchical clustering approach is based on a combinatorial optimization problem formulated on graphs and referred in the literature to as minimum quartet tree cost (MQTC) problem, which was introduced for th