A Method of Classification Performance Improvement Via a Strategy of Clustering-Based Data Elimination Integrated with k

  • PDF / 901,899 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 95 Downloads / 142 Views

DOWNLOAD

REPORT


RESEARCH ARTICLE-ELECTRICAL ENGINEERING

A Method of Classification Performance Improvement Via a Strategy of Clustering-Based Data Elimination Integrated with k-Fold Cross-Validation Onur Inan1

· Mustafa Serter Uzer2

Received: 14 February 2020 / Accepted: 17 September 2020 © King Fahd University of Petroleum & Minerals 2020

Abstract Non-system errors that occur during data entry or data collection create noisy data that reduce the success of classification systems. To eliminate this data, a classification system with a new data reduction method consisting of a modified k-means algorithm using relief algorithm coefficients named MKMA-RAC was developed. The main theme of this article is the elimination of noisy data and its consistent application to the classification system using the k-fold cross-validation method. By means of the developed system, the training data became free from noisy data by integrating the support vector machine, linear discriminant analysis (LDA) and decision tree classifiers with MKMA-RAC-based data reduction for every fold. The data reduction process was not applied for the test data. Datasets used in the proposed method were the Hepatitis, Liver Disorders, SPECT images and Statlog (Heart) dataset taken from the UCI database. Classification performance values obtained both from the proposed method and without the proposed method with tenfold CV were given for these datasets. For Hepatitis, Liver Disorders, SPECT images and Statlog (Heart) datasets, and classification successes of the proposed system with SVM classifier were 96.88%, 74.56%, 87.24%, and 90.00%, classification successes of the proposed system with LDA classifier were 94.91%, 69.05%, 82.38%, and 88.52%, classification successes of the proposed system with decision tree classifier were 96.25%, 77.73%, 88.77% and 89.63%, respectively. The test results have shown that the proposed system generally achieved higher classification performance than other literature results. Therefore, the performance is very encouraging for pattern recognition applications. Keywords Clustering-based data elimination · Relief · Medical dataset classification

Abbreviations MKMA-RAC Modified k-means algorithm using relief algorithm coefficients k-fold CV k-Fold cross-validation SVM Support vector machine FS Feature selection SPECT Single-proton emission computed tomography

B

Onur Inan [email protected] Mustafa Serter Uzer [email protected]

1

Computer Engineering, Faculty of Engineering and Architecture, Necmettin Erbakan University, Konya, Turkey

2

Electronics and Automation, Selcuk University Ilgın Vocational School, Konya, Turkey

PPV NPV LDA

Positive predictive value Negative predictive value Linear discriminant analysis

1 Introduction The large amounts of datasets obtained from the medical treatment and diagnosis processes have been one of the most important fields of study on pattern recognition and data mining techniques. These medical datasets are also used in order to test the newly developed artificial intelligence techniques.