FLPI: An Optimal Algorithm for Document Indexing

LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. In this paper, we propose a optimal algorithm called FLPI which decomposes the LPI problem as a graph embedding problem plus a regularized least squares

PDF / 285,403 Bytes
8 Pages / 430 x 660 pts Page_size
81 Downloads / 237 Views

DOWNLOAD

REPORT

Department of Information Engineer Zhejiang Business Technology Institute, Ningbo, P.R. China {tjw, yqf}@zjbti.net.cn 2 College of Information Science and Engineering, Ningbo University, Ningbo, P.R. China [email protected]

Abstract. LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. In this paper, we propose a optimal algorithm called FLPI which decomposes the LPI problem as a graph embedding problem plus a regularized least squares problem. Such modification avoids eigen decomposition of dense matrices and can significantly reduce both time and memory cost in computation. Moreover, with a specifically designed graph in supervised situation, LPI only needs to solve the regularized least squares problem which is a further saving of time and memory. Real and synthetic data experimental results show that FLPI obtains similar or better results comparing to LPI and it is significantly faster. Keywords: Locality preserving indexing (LPI), Latent semantic indexing (LSI), Document indexing, Dimensionality reduction.

1 Introduction Recently, Locality Preserving Indexing (LPI) is proposed to discover the discriminant structure of the document space. It has shown that it can have more discriminative power than LSI[1]. However, the computational complexity of LPI is very expensive because it involves eigen-decompositions of two dense matrices. It is almost infeasible to apply LPI on very large data set. m n Given a set of documents { xi }i=1 ⊂ R , which can be represented as a termdocument matrix X =[x1, x2, …, xm]. Suppose the rank of X is r, Latent Semantic Indexing(LSI) decompose the X by using SVD as follows:

X = U ∑V T ,

(1)

where ∑= diag(σ1, … , σr) and σ1 ≥σ2 ≥…≥σr are the singular values of X, U = [u1, … , ur] and ui’s are called left singular vectors, V = [v1, … , vr] and vi’s are called right singular vectors. Given a similarity matrix W, LPI can be obtained by solving the following minimization problem:

a* = arg min

m m

2

∑ ∑ ( aT xi − aT x j ) W ij = arg min aT XL X T a ,

aT XD X T a =1 i =1 j =1

aT XD X T a =1

G. Wang et al. (Eds.): RSKT 2008, LNAI 5009, pp. 644–651, 2008. © Springer-Verlag Berlin Heidelberg 2008

(2)

FLPI: An Optimal Algorithm for Document Indexing

645

where D is a diagonal matrix whose entries are column sums of W (Dii = ∑j Wji) and L = D−W is the graph Laplacian [2]. LPI constructs the similarity matrix W as:

⎧ xiT x j ⎪ W ij = ⎨ || xi |||| x j || ,if xi ∈ N p ( x j )or x j ∈ N p ( xi ) , ⎪ ⎩0 , otherwise .

(3)

where Np(xi) is the set of p nearest neighbors of xi. Thus, the objective function in LPI incurs a heavy penalty if neighboring points xi and xj are mapped far apart [5]. The basis functions of LPI are the eigenvectors associated with the smallest eigen values of the following generalized eigen-problem: XL X T a = λXD X T a . Since we have L = D−W, it is easy to check that the minimization problem in Eqn. (1) is equivalent to the following maximization problem:

a* = arg min aT XW X T a , aT XD X T a =1

(4)

and the op

Data Loading...

FLPI: An Optimal Algorithm for Document Indexing

Recommend Documents

An Optimal Algorithm for Bisection for Bounded-Treewidth Graph

An Algorithm for Optimal Placement of Voltage Sag Monitors

An Electronic Document for Distributed Electronic Services

An optimal pruning algorithm of classifier ensembles: dynamic programming approach

Discrete Time Optimal Control Algorithm

Indexing for Similarity Search

Indexing

An Optimal Joint User Association and Power Allocation Algorithm for Secrecy Information Transmission in Heterogeneous I

An Optimal Packet Assignment Algorithm for Multi-level Network Intrusion Detection Systems

An Efficient Framework for Text Document Security and Privacy

An Improved Reinforcement Learning Based Heuristic Dynamic Programming Algorithm for Model-Free Optimal Control

An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm