Comparing High-Dimensional Partitions with the Co-clustering Adjusted Rand Index

  • PDF / 1,585,862 Bytes
  • 29 Pages / 439.642 x 666.49 pts Page_size
  • 33 Downloads / 164 Views

DOWNLOAD

REPORT


Comparing High-Dimensional Partitions with the Co-clustering Adjusted Rand Index Valerie Robert1,2

· Yann Vasseur1 · Vincent Brault3

Accepted: 7 October 2020 © The Classification Society 2020

Abstract We consider the simultaneous clustering of rows and columns of a matrix and more particularly the ability to measure the agreement between two co-clustering partitions. The new criterion we developed is based on the Adjusted Rand Index and is called the Co-clustering Adjusted Rand Index named CARI. We also suggest new improvements to existing criteria such as the classification error which counts the proportion of misclassified cells and the Extended Normalized Mutual Information criterion which is a generalization of the criterion based on mutual information in the case of classic classifications. We study these criteria with regard to some desired properties deriving from the co-clustering context. Experiments on simulated and real observed data are proposed to compare the behavior of these criteria. Keywords Co-clustering · Adjusted rand index · Mutual information · Agreement · Partition

1 Introduction With the advent of sparse and high-dimensional datasets in statistics, co-clustering has become a topic of interest in recent years in many fields of applications. For example, in the context of text mining, Dhillon et al. (2003) aim at finding similar documents and their

 Valerie Robert

[email protected] Yann Vasseur [email protected] Vincent Brault [email protected] 1

Laboratoire de Math´ematiques, UMR 8628, Bˆatiment 425, Universit´e Paris Saclay, F-91405, Orsay, France

2

LIM - Laboratoire d’Informatique et de Math´ematiques, Universit´e de la R´eunion 2, Rue Joseph Wetzell, 97490, Sainte-Clotilde, France

3

Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000, Grenoble, France

Journal of Classification

interplay with word clusters. In genomics, the objective of Jagalur et al. (2007) is to identify groups of genes whose expression is linked to anatomical sub-structures in the mouse brain. As for Shan and Banerjee (2008) or Wyse et al. (2017), they used co-clustering to obtain groups of users sharing the same movie tastes and to improve recommendation systems. Finally, we can cite a new approach in pharmacovigilance proposed by Keribin et al. (2017) using co-clustering to simultaneously provide clusters of individuals sharing the same drug profile along with the link between drugs and non-detected adverse events. Originally developed by Hartigan (1975), a co-cluster analysis aims at reducing the data matrix to a simpler one while preserving the information contained in the initial matrix (Govaert and Nadif 2013). Co-clustering methods may provide a simultaneous partition of two sets A (rows, observations, individuals) and B (columns, variables, attributes). The major advantages of this method consist in drastically decreasing the dimension of clustering and the ease to deal with sparse data (Dhillon et al. 2003). Thanks to co-clustering methods, information about the d