Clustering non-linear interactions in factor analysis

  • PDF / 2,958,315 Bytes
  • 24 Pages / 439.37 x 666.142 pts Page_size
  • 48 Downloads / 191 Views

DOWNLOAD

REPORT


Clustering non-linear interactions in factor analysis Erick da Conceição Amorim1 · Vinícius Diniz Mayrink1 Received: 28 March 2020 / Accepted: 29 August 2020 © Sapienza Università di Roma 2020

Abstract Factor analysis is a powerful tool for dimensionality reduction in multivariate studies. This study extends the factor model with non-linear interactions. The main contribution of our work is to present two approaches to cluster the non-linear interactions and thus develop new models that are not restricted to the extreme scenarios where all non-null interactions are different or all are the same. The first strategy to handle the clusters involves a finite mixture of degenerate components. The second option is specified via the Dirichlet process. A comprehensive simulation study is developed to explore the performance of the proposals. A sensitivity analysis is carried out to evaluate advantages of estimating a smoothness parameter defined in a covariance function of the Gaussian process establishing the non-linearity of the interactions. In terms of application, the methodology is illustrated with the analysis of gene expression levels related to four breast cancer data sets. The genes belonging to disjoint genome regions, with copy number alteration, are connected to the main factors and their non-linear interactions are estimated and clustered. The mutual investigation and comparison of these four breast cancer data sets is rarely found in the literature. Keywords Mixture · Dirichlet process · Gene expression · Breast cancer · Microarray

1 Introduction Computational advances in the past decades and the use of Markov Chain Monte Carlo (MCMC) methods [7] have stimulated the use of different factor models [13] in the Bayesian context to analyze multivariate problems where dimensionality reduction is necessary. In particular, the factor analytic approach provides interesting interpretations for the study of subjacent structures present in gene expression data; some examples are [2,14,30]. The paper proposed by [16] investigates small regions along the genome affected by Copy Number Alteration (CNA), where an atypical amount of mRNA (much higher or lower than expected) is produced. As a result of this abnormality, microarrays tend to exhibit higher/lower light intensities than expected due to the DNA duplication/deletion, respectively. The genome regions considered in this work were detected in [15], which applies the factor analysis to

B 1

Vinícius Diniz Mayrink [email protected]; [email protected] Departamento de Estatistica, ICEx, Universidade Federal de Minas Gerais, Av. Antonio Carlos, 6627, Belo Horizonte, MG 31270-901, Brazil

123

E. da Conceição Amorim, V. D. Mayrink

find CNA locations associated with lactic acidosis and hypoxia response in tumors. The search for CNA regions is also addressed in [21,24]. Identifying genome regions with CNA is important to explain the progression of a cancer, however, the detection of such regions is not the focus of our work. Here, the analysis to understand the disease pr