Three oversampling methods applied in a comparative landslide spatial research in Penang Island, Malaysia

  • PDF / 3,117,871 Bytes
  • 20 Pages / 595.276 x 790.866 pts Page_size
  • 92 Downloads / 160 Views

DOWNLOAD

REPORT


Three oversampling methods applied in a comparative landslide spatial research in Penang Island, Malaysia Han Gao1 · Pei Shan Fam1 · Lea Tien Tay2 · Heng Chin Low3 Received: 9 March 2020 / Accepted: 6 August 2020 © Springer Nature Switzerland AG 2020

Abstract Two main problems in landslide spatial prediction research are the lack of landslide samples (minority) to train the models and the misunderstanding of assigning equal costs to different misclassifications. In order to handle the problems properly, the research is conducted based on two main objectives, which are to augment the landslide sample data in an efficient way and to assign proper unequal costs to the two types of error when training and evaluating models. Resampling techniques, including random oversampling technique, synthetic minority oversampling technique and self-creating oversampling technique (SCOTE), are used to augment the minority class samples. Logistic regression (LR) and support vector machine (SVM) are used for landslide spatial classification. Receiver operating characteristic and cost curves are used to evaluate the models. The results show that the SVM models trained using the dataset generated by SCOTE with sample size of 10,000 have the best prediction performance. The nonparametric test, Kruskal–Wallis test, is used to test the difference of sample size between different groups, which shows that LR models are more sensitive to the change of sample size. Two landslide susceptibility maps are produced based on the models with the best prediction performance. The verification results show that the maps both successfully predict more than 86% of the susceptible area, which can provide valid information on landslide mitigation and prediction to the local authorities. Keywords  Landslide susceptibility mapping · Wilcoxon rank-sum test · Cost curve · Self-creating oversampling · Support vector machine

1 Introduction Landslides are a type of frequently occurred natural disaster, which can cause huge losses on life and property every year around the globe [24, 39]. Landslide susceptibility mapping (LSM) technique is a popular and powerful tool for landslide spatial assessment (LSA), which plays an essential role in landslide management and mitigation. A general problem in analyzing landslide data is the high imbalance ratio of non-landslide (majority) and landslide (minority) samples, which is an intrinsic problem to landslide spatial prediction domain [51]. In other words, the

landslide samples of the raw data are rarely compared to the non-landslide samples. Generally, landslide data can provide more useful and valuable information than nonlandslide data for geological experts and data scientists when conducting LSA since the main objective of LSA is to predict the landslides successfully. However, non-landslide data are useful as well to avoid overfitting and a generalized overestimation of hazard. If the researchers select samples directly and randomly from the raw dataset, the problem of severe imbalance ratio probably occurs, whic