Bayesian Matrix Factorization for Outlier Detection: An Application in Population Genetics

We present a new Bayesian hierarchical model based on matrix factorization for detecting outliers in high-dimensional data. Outliers are explicitly modeled using both a shift-in-mean and variance inflation approach. The Bayesian framework provides intrins

  • PDF / 193,139 Bytes
  • 5 Pages / 439.36 x 666.15 pts Page_size
  • 68 Downloads / 173 Views

DOWNLOAD

REPORT


Bayesian Matrix Factorization for Outlier Detection: An Application in Population Genetics Nicolas Duforet-Frebourg and Michael G.B. Blum

Abstract We present a new Bayesian hierarchical model based on matrix factorization for detecting outliers in high-dimensional data. Outliers are explicitly modeled using both a shift-in-mean and variance inflation approach. The Bayesian framework provides intrinsic probabilities of being an outlier for each element in the sample. Posterior replicates of the parameters are simulated using a MCMC algorithm. In population genetics where many genetic markers are typed in different populations, we show that this model can be used to detect genes targeted by Darwinian selection.

28.1 Introduction Matrix factorization aims at decomposing a high-dimensional n × p data matrix into a product of two lower rank K matrices called the factor and loading matrices [4]. Matrix factorization provides a useful framework to model outliers in the lowerdimensional space generated by the low-rank approximation [3]. Detecting outliers in high-dimensional data sets is of interest in population genetics in order to detect genes under selective pressures [1]. The proposed approach provides an intrinsic probability of being an outlier so that we can estimate false discovery rate (FDR) and q-values, which are two important quantities in whole-genome scans [6]. We provide a MCMC algorithm to sample replicates from the posterior distribution and we show how the method can detect genes under selection in population genetics data.

N. Duforet-Frebourg () • M.G.B. Blum Laboratoire TIMC-IMAG UMR 5525, Centre National de la Recherche Scientifique, Université Joseph Fourier, Grenoble, France e-mail: [email protected]; [email protected] 143 E. Lanzarone and F. Ieva (eds.), The Contribution of Young Researchers to Bayesian Statistics, Springer Proceedings in Mathematics & Statistics 63, DOI 10.1007/978-3-319-02084-6__28, © Springer International Publishing Switzerland 2014

144

N. Duforet-Frebourg and M.G.B. Blum

28.2 Bayesian Matrix Factorization for Outlier Detection 28.2.1 Model The probabilistic model of matrix factorization—also known as factor or probabilistic PCA model—for a design n × p matrix Y relies on a product between a factor matrix F and a loading matrix Λ: Y = F Λ + ,

(1)

where F is an n × K matrix, Λ is a K × p matrix, and  is an n × p residual matrix where each row i ∼ N (0p , σ 2 Ip ). Here, we choose a Gaussian prior for Λ p 2 N (Λj ; 0K , σΛ IK ). p(Λ|σΛ ) = Πj=1

(2)

To specify the prior of F , we explicitly model outliers using the shift-in-mean approach [5] for one of the K factors of the low-rank approximation (Zi )

n N (Fi ; 0K + Ai p(F |A, Z, ΣF ) = Πi=1

, ΣF ),

(3)

where ΣF is a diagonal matrix with values σF2 k . We specify improper priors for 2 ) ∝ σ12 and p(σF2 k ) ∝ σ21 . Shift vector Ai s are zero-valued vectors variances p(σΛ Λ

Fk

with nonzero component at index Zi . For i = 1, . . . , n, Zi is an integer between 0 and K, indicating that the ith line is eit