Semantic string operation for specializing AHC algorithm for text clustering

  • PDF / 584,616 Bytes
  • 18 Pages / 439.642 x 666.49 pts Page_size
  • 98 Downloads / 212 Views

DOWNLOAD

REPORT


Semantic string operation for specializing AHC algorithm for text clustering Taeho Jo1

© Springer Nature Switzerland AG 2020

Abstract This article proposes the modified AHC (Agglomerative Hierarchical Clustering) algorithm which clusters string vectors, instead of numerical vectors, as the approach to the text clustering. The results from applying the string vector based algorithms to the text clustering were successful in previous works and synergy effect between the text clustering and the word clustering is expected by combining them with each other; the two facts become motivations for this research. In this research, we define the operation on string vectors called semantic similarity, and modify the AHC algorithm by adopting the proposed similarity metric as the approach to the text clustering. The proposed AHC algorithm is empirically validated as the better approach in clustering texts in news articles and opinions. We need to define and characterize mathematically more operations on string vectors for modifying more advanced machine learning algorithms. Keywords String vector · Semantic similarity · String vector based AHC algorithm · Text clustering Mathematics Subject Classification (2010) 68T05

1 Introduction Text clustering refers to the process of segmenting an entire collection of texts into subcollections of similar ones. For doing the task, we prepare texts, represent them into their structured forms, and define a similarity measure or similarity measures among the representations. Clusters of texts are built by computing their similarities based on their contents. As subsequent tasks, we consider selection of the representative text in each cluster and relevant naming of clusters based on their contents. The results from clustering texts depend strongly on the scheme of computing similarities among the representations of texts. Let us mention the cases which provide motivations for doing this research. In the traditional system, encoding texts into numerical vectors caused the three main problems: the  Taeho Jo

[email protected] 1

190 Garosuro, Cheongju, 28168, South Korea

T. Jo

huge dimensionality, the sparse distribution, and the poor transparency [18]. Previously, as the solution to the problems in encoding so, we proposed that texts are encoded into tables, but its performance was unstable depending on the given domain [18]. In encoding texts so, it requires the optimization of the table size between the reliability and the speed [18]. Therefore, in this research, we attempt to encode texts into string vectors as the challenges against the problems. Let us consider what we propose in this research as the solutions to the above problems. Instead of numerical vectors, texts are encoded into string vectors as their structured forms. We define the semantic similarity measure between two string vectors as the operation which corresponds to the cosine similarity between two numerical vectors. Based on the similarity measure, we modify the AHC (Agglomerate Hierarchical Clustering) algorithm i