XNN Graph

K-nearest neighbor graph (KNN) is a widely used tool in several pattern recognition applications but it has drawbacks. Firstly, the choice of k can have significant impact on the result because it has to be fixed beforehand, and it does not adapt to the l

PDF / 1,174,834 Bytes
11 Pages / 439.37 x 666.142 pts Page_size
52 Downloads / 215 Views

DOWNLOAD

REPORT

University of Eastern Finland, Joensuu, Finland [email protected] 2 Ningbo University, Ningbo, China

Abstract. K-nearest neighbor graph (KNN) is a widely used tool in several pattern recognition applications but it has drawbacks. Firstly, the choice of k can have signiﬁcant impact on the result because it has to be ﬁxed beforehand, and it does not adapt to the local density of the neighborhood. Secondly, KNN does not guarantee connectivity of the graph. We introduce an alternative data structure called XNN, which has variable number of neighbors and guarantees connectivity. We demonstrate that the graph provides improvement over KNN in several applications including clustering, classiﬁcation and data analysis. Keywords: KNN

Neighborhood graph Data modeling

1 Introduction Neighborhood graphs are widely used in data mining, machine learning and computer vision to model the data. Few applications examples are listed below. • • • • • • •

KNN classiﬁer Manifold learning 3D object matching Clustering Outlier detection Traveling salesman problem Word similarity in web mining

Two popular deﬁnitions are ɛ-neighborhood and k–nearest neighbor (KNN) graph [1]. In the ﬁrst one, two points are neighbors if they are within a distance ɛ of each other. KNN neighborhood of a point is deﬁned as its k nearest other data points in the data space. In the corresponding graph, all neighboring points are connected. The ɛ–neighborhood graph is undirected whereas KNN is directed graph. The ﬁrst problem, both in KNN and also ɛ-neighborhood, is how to select parameters ɛ and k. The larger the neighborhood, the better the local structures can be captured but the more complex the graph and the higher the processing time. An upper limit is when k equals to the size of the data (k = N), which leads to a complete graph. Having ﬁxed value of k, there can be unnecessary long edges in sparse areas that do not capture anything essential about the local structure. The second problem is that these deﬁnitions do not guarantee connectivity of the graph. In Fig. 1, an isolated component is created, which leads to wrong clustering © Springer International Publishing AG 2016 A. Robles-Kelly et al. (Eds.): S+SSPR 2016, LNCS 10029, pp. 207–217, 2016. DOI: 10.1007/978-3-319-49055-7_19

208

P. Fränti et al.

Isolated component

Fig. 1. Example part of 3NN graph where connectivity is broken.

result. With this data the problem can be solved by setting k = 4 at the cost of increased complexity. However, with higher dimensional data the value must usually be set to k = 20 or even higher. This overwhelms the computation. In general, there is no good way to set k automatically, and it does not adapt to the local density. According to [2], k has to be chosen rather high, order of N rather than logN. In [3] this neighborhood was reduced by eliminating edges to nodes that were reached via alternative shorter detour via a third node. But it does not solve the connectivity and requires large k to start with. Despite these drawbacks, KNN has become popular – mainly

Data Loading...

XNN Graph

Recommend Documents

Graph

Graph

Graph

Graph

Graph

Approximate Sub-graph Matching over Knowledge Graph

Graph Theory

Graph Drawing

Graph Theory

Graph Database

Graph sampling

Graph Theory