Gain ratio weighted inverted specific-class distance measure for nominal attributes
- PDF / 2,479,692 Bytes
- 10 Pages / 595.276 x 790.866 pts Page_size
- 67 Downloads / 182 Views
ORIGINAL ARTICLE
Gain ratio weighted inverted specific‑class distance measure for nominal attributes Fang Gong1 · Liangxiao Jiang2 · Huan Zhang2 · Dianhong Wang3 · Xingfeng Guo4 Received: 27 May 2019 / Accepted: 25 February 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract Enhancing distance measures is key to improving the performances of many machine learning algorithms, such as instancebased learning algorithms. Although the inverted specific-class distance measure (ISCDM) is among the top performing distance measures addressing nominal attributes with the presence of missing values and non-class attribute noise in the training set, this still requires the attribute independence assumption. It is obvious that the attribute independence assumption required by the ISCDM is rarely true in reality, which harms its performance in applications with complex attribute dependencies. Thus, in this study we propose an improved ISCDM by utilizing attribute weighting to circumvent the attribute independence assumption. In our improved ISCDM, we simply define the weight of each attribute as its gain ratio. Thus, we denote our improved ISCDM as the gain ratio weighted ISCDM (GRWISCDM for short). We tested the GRWISCDM experimentally on 29 University of California at Irvine datasets, and found that it significantly outperforms the original ISCDM and some other state-of-the-art competitors in terms of the negative conditional log likelihood and root relative squared error. Keywords Distance metric learning · Specific-class · Attribute weighting · Gain ratio
1 Introduction Distance measures are utilized to measure the similarity between two instances, and are also key to achieving a strong classification performance for many machine learning * Liangxiao Jiang [email protected] Fang Gong [email protected] Huan Zhang [email protected] Dianhong Wang [email protected] Xingfeng Guo [email protected] 1
School of Automation, China University of Geosciences, Wuhan 430074, China
2
School of Computer Science, China University of Geosciences, Wuhan 430074, China
3
School of Mechanical and Electronic Information, China University of Geosciences, Wuhan 430074, China
4
School of Electrical and Information Engineering, Wuhan Institute of Technology, Wuhan 430205, China
algorithms, such as instance-based learning [1], self-organizing [31], radial basis functions [5], and k-means clustering [35]. Owing to their importance, distance functions have been researched by the machine learning community and successfully applied to many real-world application domains, such as information retrieval, image processing, face recognition, cognitive psychology, and bioinformatics [3, 23, 46]. Many distance measures have been proposed, and most of these perform well for numeric rather than nominal attributes. When all of the attributes are nominal, the overlapping metric (OM) [1, 9, 52] is broadly employed, owing to its simplicity and efficiency. However, the OM fails to exploit the additional information provided b
Data Loading...