Discovering dependencies with reliable mutual information

PDF / 1,046,452 Bytes
31 Pages / 439.37 x 666.142 pts Page_size
57 Downloads / 223 Views

Discovering dependencies with reliable mutual information Panagiotis Mandros1 · Mario Boley2 · Jilles Vreeken3 Received: 11 March 2019 / Revised: 30 June 2020 / Accepted: 4 July 2020 © The Author(s) 2020

Abstract We consider the task of discovering functional dependencies in data for target attributes of interest. To solve it, we have to answer two questions: How do we quantify the dependency in a model-agnostic and interpretable way as well as reliably against sample size and dimensionality biases? How can we efficiently discover the exact or α-approximate topk dependencies? We address the first question by adopting information-theoretic notions. Specifically, we consider the mutual information score, for which we propose a reliable estimator that enables robust optimization in high-dimensional data. To address the second question, we then systematically explore the algorithmic implications of using this measure for optimization. We show the problem is NP-hard and justify worst-case exponential-time as well as heuristic search methods. We propose two bounding functions for the estimator, which we use as pruning criteria in branch-and-bound search to efficiently mine dependencies with approximation guarantees. Empirical evaluation shows that the derived estimator has desirable statistical properties, the bounding functions lead to effective exact and greedy search algorithms, and when combined, qualitative experiments show the framework indeed discovers highly informative dependencies. Keywords Information theory · Knowledge discovery · Approximate functional dependency · Pattern mining · Algorithms · Branch-and-bound

1 Introduction Given data D from a joint distribution p(I , Y ) over input variables I = {X 1 , . . . , X d } and a target of interest Y , it is a fundamental problem in knowledge discovery to find subsets X ⊆ I

B

Panagiotis Mandros [email protected] Mario Boley [email protected] Jilles Vreeken [email protected]

1

Max Planck Institute for Informatics and Saarland University, Saarland Informatics Campus, Saarbrücken, Germany

2

Monash University, Melbourne, Australia

3

CISPA Helmholtz Center for Information Security, Saarbrücken, Germany

123

P. Mandros et al.

that jointly influence or (approximately) determine Y . These dependencies are essential for a variety of applications. In scientific domains, for example, the analysis often involves identifying compact sets of descriptors that capture the underlying process of the various phenomena under investigation [1,2]. The task of functional dependency discovery can be formulated as finding the top-k attribute subsets X1∗ , . . . , Xk∗ ⊆ I with ∗ Q(Xi∗ ; Y ) = max{Q(X ; Y ) : Q(Xi−1 ; Y ) ≥ Q(X ; Y ), X ⊆ I } ,

(1)

where Q is some function quantifying the functional dependency of Y on X . For an effective knowledge discovery procedure, Q should be able to identify any type of dependency, e.g., nonlinear, multivariate, without a priori assumptions on the underlying data generating process p [3]. Moreover, solutions to Eq. (1), besi

Data Loading...

Discovering dependencies with reliable mutual information

Recommend Documents

Maximum MIMO System Mutual Information with Antenna Selection and Interference

Dependencies

Learning Disentangled Representations via Mutual Information Estimation

Mutual Information for Quaternion Time Series

Scale Dependencies

Discovering Space

Registration of Ultrasound Liver Images Using Mutual Information Technique

Mutual Information-Based Hierarchical Band Selection Approach for Hyperspectral Images

The Improvement Research of Mutual Information Algorithm for Text Categorization

Path Functional Dependencies

Database Dependencies

Leakage Assessment Through Neural Estimation of the Mutual Information