Constraint-Based Sequence Mining Using Constraint Programming
The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a gener
- PDF / 575,326 Bytes
- 18 Pages / 439.37 x 666.142 pts Page_size
- 113 Downloads / 299 Views
Abstract. The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a general framework is still missing. We investigate the use of constraint programming as general framework for this task. We first identify four categories of constraints that are applicable to sequence mining. We then propose two constraint programming formulations. The first formulation introduces a new global constraint called exists-embedding. This formulation is the most efficient but does not support one type of constraint. To support such constraints, we develop a second formulation that is more general but incurs more overhead. Both formulations can use the projected database technique used in specialised algorithms. Experiments demonstrate the flexibility towards constraint-based settings and compare the approach to existing methods. Keywords: Sequential pattern mining · Sequence mining · Episode mining · Constrained pattern mining · Constraint programming · Declarative programming
1
Introduction
In AI in general and in data mining in particular, there is an increasing interest in developing general methods for data analysis. In order to be useful, such methods should be easy to extend with domain-specific knowledge. In pattern mining, the frequent sequence mining problem has already been studied in depth, but usually with a focus on efficiency and less on generality and extensibility. An important step in the development of more general approaches was the cSpade algorithm [20] which supports a variety constraints. It supports many constraints such as constraints on the length of the pattern, on the maximum gap in embeddings or on the discriminative power of the patterns between datasets. Many other constraints have been integrated into specific mining algorithms (e.g. [6,14,17,18]). However, none of these are truly generic in that adding c Springer International Publishing Switzerland 2015 L. Michel (Ed.): CPAIOR 2015, LNCS 9075, pp. 288–305, 2015. DOI: 10.1007/978-3-319-18008-3 20
Constraint-Based Sequence Mining Using Constraint Programming
289
extra constraints usually amounts to changing the data-structures used in the core of the algorithm. For itemset mining, the simplest form of pattern mining, it has been shown that constraint programming (CP) can be used as a generic framework for constraint-based mining [5] and beyond [11,15]. Recent works have also investigated the usage of CP-based approaches for mining sequences with explicit wildcards [3,7,8]. A wildcard represents the presence of exactly one arbitrary symbol in that position in the sequence. The main difference between mining itemsets, sequences with wildcards and standard sequences lies in the complexity of testing whether a pattern is included in another itemset/sequence, e.g. from the database. For itemsets, this is simply testing the subset inclusion relation which is easy to encode in
Data Loading...