Distributions of pattern statistics in sparse Markov models

  • PDF / 1,149,069 Bytes
  • 19 Pages / 439.37 x 666.142 pts Page_size
  • 74 Downloads / 229 Views

DOWNLOAD

REPORT


Distributions of pattern statistics in sparse Markov models Donald E. K. Martin1 Received: 28 July 2018 / Revised: 27 January 2019 © The Institute of Statistical Mathematics, Tokyo 2019

Abstract Markov models provide a good approximation to probabilities associated with many categorical time series, and thus they are applied extensively. However, a major drawback associated with them is that the number of model parameters grows exponentially in the order of the model, and thus only very low-order models are considered in applications. Another drawback is lack of flexibility, in that Markov models give relatively few choices for the number of model parameters. Sparse Markov models are Markov models with conditioning histories that are grouped into classes such that the conditional probability distribution for members of each class is constant. The model gives a better handling of the trade-off between bias associated with having too few model parameters and variance from having too many. In this paper, methodology for efficient computation of pattern distributions through Markov chains with minimal state spaces is extended to the sparse Markov framework. Keywords Auxiliary Markov chain · Pattern distribution · Sparse Markov model · Variable length Markov chain

1 Introduction Mining data for exceptional patterns can lead to important discoveries, such as sites on DNA sequences that are recognized by various agents, locations of an outbreak of an epidemic, changes in a production process, and similar segments of biological sequences. In using a pattern statistic for detection of special segments, one models the data sequence under a null hypothesis that characterizes what is typical, and uses the model to quantify exceptionality of the statistic (through the associated p value). Pattern statistics and their distributions may also be used to summarize important properties of a sequence. A method is therefore needed to obtain their distributions.

B 1

Donald E. K. Martin [email protected] Department of Statistics, North Carolina State University, 4272 SAS Hall, 2311 Stinson Drive, Raleigh, NC 27695-8203, USA

123

D. E. K. Martin

Whereas Monte Carlo methods may be used to approximate probabilities, large numbers of replicates are typically required for accurate results, especially for events whose occurrence is rare. More accurate methods that are also less computationally expensive are desired. A simple but powerful approach for computing exact distributions of pattern statistics in an mth-order discrete Markovian sequence X ≡ X 1 , . . . , X n is to set up a correspondence between events related to the statistic of interest and events related to an auxiliary Markov chain (AMC) Y ≡ Y1 , . . . , Yn . This allows using basic properties of Markov chains to compute the desired probabilities. The approach was forwarded in Brookner (1966) and became popular in more recent years with the work of Fu and Koutras (1994) and Koutras and Alexandrou (1995). In the present paper, the main focus is on extending Markov chain-base