Two-stage adaptive integration of multi-source heterogeneous data based on an improved random subspace and prediction of

  • PDF / 917,230 Bytes
  • 11 Pages / 595.276 x 790.866 pts Page_size
  • 63 Downloads / 133 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789().,-volV)

S.I. : SPIOT 2020

Two-stage adaptive integration of multi-source heterogeneous data based on an improved random subspace and prediction of default risk of microcredit Anzhong Huang1 • Fei Wu2 Received: 16 August 2020 / Accepted: 27 October 2020 Ó Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Some scholars have shown that the machine learning methods based on a single-source data can successfully monitor the risks of formal financial activities, but not those of informal financial activities. This is because the data generated by formal financial activities, whether it is the structured or unstructured data, are of high quality and quantity, while the data generated by informal financial activities are not. Therefore, multi-source data are the key to monitor the risks of informal financial activities through machine learning. Although a few studies attempted to use multi-source data for financial risk prediction, they simply stack the obtained multi-source data, but ignore the original sources, heterogeneity, mutual redundancy and other characteristics of the data, so that the improvement of the prediction effect is not obvious. Therefore, TSAIB_RS method based on the two-stage adaptive integration of multi-source heterogeneous data was constructed in the paper, in which the data with different sources and different distributions were adaptively integrated. In order to test the reliability of TSAIB_RS method, the paper takes the default risk of microcredit in China as the test target and compares the prediction results of various test methods. It concludes that TSAIB_RS method can significantly improve the prediction effects. Keywords Multi-source heterogeneous data  Adaptive integration  Microcredit risk

1 Introduction Information asymmetry is the root cause of financial risks, and obtaining as much information as possible is the key to predict financial risks. As a result, some scholars put forward the problem of multi-source information in financial risk monitoring earlier [1, 2], which means that banks should not only use hard information (financial statement information), but also use soft information (financial statement information) to reduce credit risk. However, soft information is often unstructured data, which cannot be used by statistics and econometrics, which

& Fei Wu [email protected] 1

School of Economics and Management, Jiangsu University of Science and Technology, Zhenjiang 212003, China

2

School of Law, Shanghai University of Finance and Economics, Shanghai 200433, China

are the traditional financial risk prediction methods. It greatly limits the improvement of financial risk prediction accuracy, because a large amount of information in the Internet era is unstructured data. Therefore, machine learning is an excellent supplement to the traditional methods of financial risk prediction. As for the relationship between the data used in machine learning and the prediction effect, Tsai found that the algorithm of risk