Opening the Black Box: Revealing Interpretable Sequence Motifs in Kernel-Based Learning Algorithms

This work is in the context of kernel-based learning algorithms for sequence data. We present a probabilistic approach to automatically extract, from the output of such string-kernel-based learning algorithms, the subsequences—or motifs—truly underlying t

  • PDF / 1,620,378 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 55 Downloads / 207 Views

DOWNLOAD

REPORT


Berlin Institute of Technology, 10587 Berlin, Germany [email protected], {nico.goernitz,klaus-robert.mueller}@tu-berlin.de Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136-713, Republic of Korea 3 Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA [email protected] 4 Humboldt University of Berlin, 10099 Berlin, Germany [email protected]

Abstract. This work is in the context of kernel-based learning algorithms for sequence data. We present a probabilistic approach to automatically extract, from the output of such string-kernel-based learning algorithms, the subsequences—or motifs—truly underlying the machine’s predictions. The proposed framework views motifs as free parameters in a probabilistic model, which is solved through a global optimization approach. In contrast to prevalent approaches, the proposed method can discover even difficult, long motifs, and could be combined with any kernel-based learning algorithm that is based on an adequate sequence kernel. We show that, by using a discriminate kernel machine such as a support vector machine, the approach can reveal discriminative motifs underlying the kernel predictor. We demonstrate the efficacy of our approach through a series of experiments on synthetic and real data, including problems from handwritten digit recognition and a large-scale human splice site data set from the domain of computational biology.

1

Introduction

In the view of the rapidly increasing amount of data collected in science and technology, effective automation of decisions is necessary. To this end, kernelbased methods [13,17,19,26,31,32] such as support vector machines (SVM) [5,7] have found diverse applications due to their distinct merits such as the descent computational complexity, high usability, and the solid mathematical foundation [24]. Kernel-based learning allows us to obtain more complex nonlinear learning machines from simple linear ones in a canonical way, since the learning and data representation processes are decoupled in a modular fashion. Yet, after more than a decade of research, kernel methods are widely considered as black boxes, and it remains an unsolved problem to make their decisions c Springer International Publishing Switzerland 2015  A. Appice et al. (Eds.): ECML PKDD 2015, Part II, LNAI 9285, pp. 137–153, 2015. DOI: 10.1007/978-3-319-23525-7 9

138

M.M.-C. Vidovic et al.

accessible or interpretable to domain experts. This is especially pressing in natural and life sciences, where not maximum prediction accuracy but unveiling the underlying natural principles is the foremost aim. In several important application fields, the data exhibits an inherent sequence structure. This includes DNA sequences in genomics, text data in natural language processing, and speech data in speech recognition. A state-of-the-art approach to learn from such sequence data consists in the weighted-degree (WD) kernel [4,27,28,31] in combination with a kernel-based learning machine such as an SVM. Given two d