Hubness-based fuzzy measures for high-dimensional k -nearest neighbor classification

  • PDF / 994,123 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 23 Downloads / 175 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification Nenad Tomasˇev • Milosˇ Radovanovic´ Dunja Mladenic´ • Mirjana Ivanovic´



Received: 9 March 2012 / Accepted: 19 November 2012  Springer-Verlag Berlin Heidelberg 2012

Abstract Most data of interest today in data-mining applications is complex and is usually represented by many different features. Such high-dimensional data is by its very nature often quite difficult to handle by conventional machine-learning algorithms. This is considered to be an aspect of the well known curse of dimensionality. Consequently, high-dimensional data needs to be processed with care, which is why the design of machine-learning algorithms needs to take these factors into account. Furthermore, it was observed that some of the arising highdimensional properties could in fact be exploited in improving overall algorithm design. One such phenomenon, related to nearest-neighbor learning methods, is known as hubness and refers to the emergence of very influential nodes (hubs) in k-nearest neighbor graphs. A crisp weighted voting scheme for the k-nearest neighbor classifier has recently been proposed which exploits this notion. In this paper we go a step further by embracing the This is an extended version of the paper Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification, which was presented at the MLDM 2011 conference [27]. N. Tomasˇev (&)  D. Mladenic´ Institute Jozˇef Stefan, Artificial Intelligence Laboratory, Jozˇef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia e-mail: [email protected] D. Mladenic´ e-mail: [email protected] M. Radovanovic´  M. Ivanovic´ Department of Mathematics and Informatics, University of Novi Sad, Trg D. Obradovic´a 4, 21000 Novi Sad, Serbia e-mail: [email protected] M. Ivanovic´ e-mail: [email protected]

soft approach, and propose several fuzzy measures for k-nearest neighbor classification, all based on hubness, which express fuzziness of elements appearing in k-neighborhoods of other points. Experimental evaluation on real data from the UCI repository and the image domain suggests that the fuzzy approach provides a useful measure of confidence in the predicted labels, resulting in improvement over the crisp weighted method, as well as the standard kNN classifier. Keywords Classification  k-nearest neighbor  Fuzzy  Hubs  Curse of dimensionality

1 Introduction High-dimensional data is ubiquitous in modern applications. It arises naturally when dealing with text, images, audio, data streams, medical records, etc. The impact of this high dimensionality is manyfold. It is a well known fact that many machine-learning algorithms are plagued by what is usually termed the curse of dimensionality. This comprises a set of properties that tend to become more pronounced as the dimensionality of data increases. First and foremost is the unavoidable sparsity of data. In highdimensional spaces all data is sparse, meaning that there is not enou