An experimental study of graph-based semi-supervised classification with additional node information

  • PDF / 903,659 Bytes
  • 35 Pages / 439.37 x 666.142 pts Page_size
  • 44 Downloads / 129 Views

DOWNLOAD

REPORT


An experimental study of graph-based semi-supervised classification with additional node information Bertrand Lebichot1

· Marco Saerens1

Received: 26 May 2018 / Revised: 21 July 2020 / Accepted: 25 July 2020 / Published online: 9 October 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract The volume of data generated by internet and social networks is increasing every day, and there is a clear need for efficient ways of extracting useful information from them. As this information can take different forms, it is important to use all the available data representations for prediction; this is often referred to multi-view learning. In this paper, we consider semi-supervised classification using both regular, plain, tabular, data and structural information coming from a network structure (feature-rich networks). Sixteen techniques are compared and can be divided in three families: the first one uses only the plain features to fit a classification model, the second uses only the network structure, and the last combines both information sources. These three settings are investigated on 10 real-world datasets. Furthermore, network embedding and well-known autocorrelation indicators from spatial statistics are also studied. Possible applications are automatic classification of web pages or other linked documents, of nodes in a social network, or of proteins in a biological complex system, to name a few. Based on our findings, we draw some general conclusions and advice to tackle this particular classification task: it is clearly observed that some dataset labelings can be better explained by their graph structure or by their features set. Keywords Network data analysis · Semi-supervised classification · Link analysis · Graph mining · Multi-view learning

1 Introduction Nowadays, with the increasing volume of data generated, for instance by internet and social networks, there is a need for efficient ways to infer useful information from those networkbased data. Moreover, these data can take several different forms and, in that case, it would be useful to use these alternative views in the prediction model—this is exactly the purpose of multi-view learning [77,86]. In this paper, we focus our attention on supervised classification

B 1

Bertrand Lebichot [email protected] Machine Learning Group – ICTEAM & LSM, Université catholique de Louvain, Place des Doyens 1, 1348 Louvain-la-Neuve, Belgium

123

4338

B. Lebichot, M. Saerens

using both regular tabular data defined on nodes and structural information coming from graphs or networks.1 This kind of data is sometimes called feature-rich networks. Of course, as discussed in [26] (see, e.g., [46] for a survey), many different approaches have been developed for information fusion in machine learning, pattern recognition and applied statistics. This includes [26] simple weighted averages (see, e.g., [15,40]), Bayesian fusion (see, e.g., [15,40]), majority vote (see, e.g., [13,43,47]), models coming from uncertainty reasoning [44] (see, e.g., [