Two-stage pruning method for gram-based categorical sequence clustering

  • PDF / 1,845,088 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 1 Downloads / 330 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Two‑stage pruning method for gram‑based categorical sequence clustering Liang Yuan1 · Wenjian Wang2 · Lifei Chen3 Received: 10 February 2017 / Accepted: 7 November 2017 © Springer-Verlag GmbH Germany, part of Springer Nature 2017

Abstract Gram-based vector space model has been extensively applied to categorical sequence clustering. However, there is a general lack of an efficient method to determine the length of grams and to identify redundant and non-significant grams involved in the model. In this paper, a variable-length gram model is proposed, different from previous studies mainly focused on the fixed-length grams of sequences. The variable-length grams are obtained using a two-stage pruning method aimed at selecting the irredundant and significant subsequences from the prefix trees, created from the fixed-length grams with an initially large length. A robust partitioning algorithm is then defined for categorical sequence clustering on the normalized representation model using variable-length grams collected from the pruned trees. Experimental results on real-world sequence sets from various domains are given to demonstrate the performance of the proposed methods. Keywords  Clustering · Categorical sequence · Representation model · Variable-length gram · Pruning method

1 Introduction Sequence clustering, aimed at partitioning a set of sequences into homogeneous groups in terms of their structural similarity, has a wide range of applications. For example, in computational biology, clustering of gene sequences into different functional groups is now an essential method for biologists, used to predict functions of genes and to analyze evolutionary relationships within homogeneous sequences. The data in such applications are in the form of categorical sequences, each being a string of categories. Unlike the common attribute-value data, categorical sequences pose a unique challenge due to the lack of an efficient method for measuring the chronological dependency of categories within sequences [1, 2]. * Lifei Chen [email protected] 1



Network Operation Maintenance Center, University of Electronic Science and Technology of China, Chengdu, China

2



Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, China

3

Digit Fujian Internet‑of‑Thing Laboratory of Environment Monitoring, Fujian Normal University, Fuzhou, China



Despite numerous clustering algorithms in the literature [3–5], most of them cannot be directly applied to sequence clustering. The main difficulty lies in the definition for a meaningful measure of similarity between sequences. Currently, a few measures have been suggested and they can be roughly divided into two classes: alignment-based and alignment-free measures [6]. Those in the first class are implemented by symbols alignment in order to identify regions of similarity for sequences. Examples include edit distance [7] and symbol alignment distance [8], etc, which typically have a high time