XNN Graph

K-nearest neighbor graph (KNN) is a widely used tool in several pattern recognition applications but it has drawbacks. Firstly, the choice of k can have significant impact on the result because it has to be fixed beforehand, and it does not adapt to the l

  • PDF / 1,174,834 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 52 Downloads / 176 Views

DOWNLOAD

REPORT


University of Eastern Finland, Joensuu, Finland [email protected] 2 Ningbo University, Ningbo, China

Abstract. K-nearest neighbor graph (KNN) is a widely used tool in several pattern recognition applications but it has drawbacks. Firstly, the choice of k can have significant impact on the result because it has to be fixed beforehand, and it does not adapt to the local density of the neighborhood. Secondly, KNN does not guarantee connectivity of the graph. We introduce an alternative data structure called XNN, which has variable number of neighbors and guarantees connectivity. We demonstrate that the graph provides improvement over KNN in several applications including clustering, classification and data analysis. Keywords: KNN

 Neighborhood graph  Data modeling

1 Introduction Neighborhood graphs are widely used in data mining, machine learning and computer vision to model the data. Few applications examples are listed below. • • • • • • •

KNN classifier Manifold learning 3D object matching Clustering Outlier detection Traveling salesman problem Word similarity in web mining

Two popular definitions are ɛ-neighborhood and k–nearest neighbor (KNN) graph [1]. In the first one, two points are neighbors if they are within a distance ɛ of each other. KNN neighborhood of a point is defined as its k nearest other data points in the data space. In the corresponding graph, all neighboring points are connected. The ɛ–neighborhood graph is undirected whereas KNN is directed graph. The first problem, both in KNN and also ɛ-neighborhood, is how to select parameters ɛ and k. The larger the neighborhood, the better the local structures can be captured but the more complex the graph and the higher the processing time. An upper limit is when k equals to the size of the data (k = N), which leads to a complete graph. Having fixed value of k, there can be unnecessary long edges in sparse areas that do not capture anything essential about the local structure. The second problem is that these definitions do not guarantee connectivity of the graph. In Fig. 1, an isolated component is created, which leads to wrong clustering © Springer International Publishing AG 2016 A. Robles-Kelly et al. (Eds.): S+SSPR 2016, LNCS 10029, pp. 207–217, 2016. DOI: 10.1007/978-3-319-49055-7_19

208

P. Fränti et al.

Isolated component

Fig. 1. Example part of 3NN graph where connectivity is broken.

result. With this data the problem can be solved by setting k = 4 at the cost of increased complexity. However, with higher dimensional data the value must usually be set to k = 20 or even higher. This overwhelms the computation. In general, there is no good way to set k automatically, and it does not adapt to the local density. According to [2], k has to be chosen rather high, order of N rather than logN. In [3] this neighborhood was reduced by eliminating edges to nodes that were reached via alternative shorter detour via a third node. But it does not solve the connectivity and requires large k to start with. Despite these drawbacks, KNN has become popular – mainly