Two-stage pruning method for gram-based categorical sequence clustering

PDF / 1,845,088 Bytes
10 Pages / 595.276 x 790.866 pts Page_size
1 Downloads / 353 Views

ORIGINAL ARTICLE

Two‑stage pruning method for gram‑based categorical sequence clustering Liang Yuan1 · Wenjian Wang2 · Lifei Chen3 Received: 10 February 2017 / Accepted: 7 November 2017 © Springer-Verlag GmbH Germany, part of Springer Nature 2017

Abstract Gram-based vector space model has been extensively applied to categorical sequence clustering. However, there is a general lack of an efficient method to determine the length of grams and to identify redundant and non-significant grams involved in the model. In this paper, a variable-length gram model is proposed, different from previous studies mainly focused on the fixed-length grams of sequences. The variable-length grams are obtained using a two-stage pruning method aimed at selecting the irredundant and significant subsequences from the prefix trees, created from the fixed-length grams with an initially large length. A robust partitioning algorithm is then defined for categorical sequence clustering on the normalized representation model using variable-length grams collected from the pruned trees. Experimental results on real-world sequence sets from various domains are given to demonstrate the performance of the proposed methods. Keywords Clustering · Categorical sequence · Representation model · Variable-length gram · Pruning method

1 Introduction Sequence clustering, aimed at partitioning a set of sequences into homogeneous groups in terms of their structural similarity, has a wide range of applications. For example, in computational biology, clustering of gene sequences into different functional groups is now an essential method for biologists, used to predict functions of genes and to analyze evolutionary relationships within homogeneous sequences. The data in such applications are in the form of categorical sequences, each being a string of categories. Unlike the common attribute-value data, categorical sequences pose a unique challenge due to the lack of an efficient method for measuring the chronological dependency of categories within sequences [1, 2]. * Lifei Chen [email protected] 1

Network Operation Maintenance Center, University of Electronic Science and Technology of China, Chengdu, China

2

Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, China

3

Digit Fujian Internet‑of‑Thing Laboratory of Environment Monitoring, Fujian Normal University, Fuzhou, China

Despite numerous clustering algorithms in the literature [3–5], most of them cannot be directly applied to sequence clustering. The main difficulty lies in the definition for a meaningful measure of similarity between sequences. Currently, a few measures have been suggested and they can be roughly divided into two classes: alignment-based and alignment-free measures [6]. Those in the first class are implemented by symbols alignment in order to identify regions of similarity for sequences. Examples include edit distance [7] and symbol alignment distance [8], etc, which typically have a high time

Data Loading...

Two-stage pruning method for gram-based categorical sequence clustering

Recommend Documents

A Novel Clustering-Based Filter Pruning Method for Efficient Deep Neural Networks

Rough subspace-based clustering ensemble for categorical data

Improved Clustering for Categorical Data with Genetic Algorithm

Ensemble Similarity Clustering Frame work for Categorical Dataset Clustering Using Swarm Intelligence

Convex clustering method for compositional data modeling

Cluster-Based Ensemble Using Distributed Clustering Approach for Large Categorical Data

Categorical data

Rough Mode: A Generalized Centroid Proposal for Clustering Categorical Data Using the Rough Set Theory

A Deep Sequence-to-Sequence Method for Aircraft Landing Speed Prediction Based on QAR Data

Pre-indexing Pruning Strategies

Advances in Sequence Analysis: Theory, Method, Applications

PRIS: A Novel Complex Sequence Construction Method