An interdisciplinary comparison of sequence modeling methods for next-element prediction

  • PDF / 797,856 Bytes
  • 21 Pages / 595.276 x 790.866 pts Page_size
  • 34 Downloads / 151 Views

DOWNLOAD

REPORT


SPECIAL SECTION PAPER

An interdisciplinary comparison of sequence modeling methods for next-element prediction Niek Tax1 · Irene Teinemaa2 · Sebastiaan J. van Zelst3,4 Received: 31 October 2018 / Revised: 3 December 2019 / Accepted: 9 March 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Data of sequential nature arise in many application domains in the form of, e.g., textual data, DNA sequences, and software execution traces. Different research disciplines have developed methods to learn sequence models from such datasets: (i) In the machine learning field methods such as (hidden) Markov models and recurrent neural networks have been developed and successfully applied to a wide range of tasks, (ii) in process mining process discovery methods aim to generate humaninterpretable descriptive models, and (iii) in the grammar inference field the focus is on finding descriptive models in the form of formal grammars. Despite their different focuses, these fields share a common goal: learning a model that accurately captures the sequential behavior in the underlying data. Those sequence models are generative, i.e., they are able to predict what elements are likely to occur after a given incomplete sequence. So far, these fields have developed mainly in isolation from each other and no comparison exists. This paper presents an interdisciplinary experimental evaluation that compares sequence modeling methods on the task of next-element prediction on four real-life sequence datasets. The results indicate that machine learning methods, which generally do not aim at model interpretability, tend to outperform methods from the process mining and grammar inference fields in terms of accuracy. Keywords Process mining · Machine learning · Grammar inference · Sequence prediction

1 Introduction A large share of the world’s data naturally occurs in sequences. Examples thereof include textual data, e.g., sequences of letters/words, DNA sequences, Web browsing behavior, and execution traces of business processes or of software systems. Several different research fields have Communicated by Rainer Schmidt and Jens Gulden.

B

Niek Tax [email protected] Irene Teinemaa [email protected] Sebastiaan J. van Zelst [email protected]

1

Eindhoven University of Technology, Eindhoven, The Netherlands

2

University of Tartu, Tartu, Estonia

3

Fraunhofer Institute for Applied Information Technology, FIT, Sankt Augustin, Germany

4

RWTH Aachen University, Aachen, Germany

focused on the development of tools and methods to model and describe such sequence data. However, these research fields mostly operate independently from each other, and, with little knowledge transfer between them. Nonetheless, the different methods from the different research fields generally share the same common goal: learning a descriptive model from a dataset of sequences such that the model accurately generalizes from the sequences that are present in the dataset. The three research communities that developed sequence model