Sampling on networks: estimating spectral centrality measures and their impact in evaluating other relevant network meas

  • PDF / 3,924,720 Bytes
  • 29 Pages / 595.276 x 790.866 pts Page_size
  • 108 Downloads / 180 Views

DOWNLOAD

REPORT


plied Network Science

Open Access

RESEARCH

Sampling on networks: estimating spectral centrality measures and their impact in evaluating other relevant network measures Nicolò Ruggeri*  and Caterina De Bacco *Correspondence: nicolo.ruggeri@tuebingen. mpg.de Max Planck Institute for Intelligent Systems, Max‑Planck Ring 4, 72076 Tübingen, Germany

Abstract  We perform an extensive analysis of how sampling impacts the estimate of several relevant network measures. In particular, we focus on how a sampling strategy optimized to recover a particular spectral centrality measure impacts other topological quantities. Our goal is on one hand to extend the analysis of the behavior of TCEC (Ruggeri and De Bacco, in: Cherifi, Gaito, Mendes, Moro, Rocha (eds) Complex networks and their applications VIII, Springer, Cham, pp 90–101, 2020), a theoretically-grounded sampling method for eigenvector centrality estimation. On the other hand, to demonstrate more broadly how sampling can impact the estimation of relevant network properties like centrality measures different than the one aimed at optimizing, community structure and node attribute distribution. In addition, we analyze sampling behaviors in various instances of network generative models. Finally, we adapt the theoretical framework behind TCEC for the case of PageRank centrality and propose a sampling algorithm aimed at optimizing its estimation. We show that, while the theoretical derivation can be suitably adapted to cover this case, the resulting algorithm suffers of a high computational complexity that requires further approximations compared to the eigenvector centrality case. Main contributions (a) Extensive empirical analysis of the impact of the TCEC sampling method (optimized for eigenvector centrality recovery) on different centrality measures, community structure, node attributes and statistics related to specific network generative models; (b) extending TCEC to optimize PageRank estimation. Keywords:  Sampling on network, Eigenvector centrality, PageRank, Centrality measures

Introduction When investigating real-world network datasets we often do not have access to the entire network information. This is the case of large datasets, having limited storage capacity or limited resources during the data collection phase. Nevertheless, this should not prevent practitioners from analyzing an available network sample. In fact, evaluating network properties while accessing only a smaller sample is a relevant problem in various fields, ranging from modelling dynamical processes (De Choudhury et al. 2010; Sadikov et  al. 2011), network statistics estimation (Leskovec and Faloutsos 2006), data compression (Adler and Mitzenmacher 2001) and survey design (Frank 2005). Imagining © The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, pro