FLPI: An Optimal Algorithm for Document Indexing
LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. In this paper, we propose a optimal algorithm called FLPI which decomposes the LPI problem as a graph embedding problem plus a regularized least squares
- PDF / 285,403 Bytes
- 8 Pages / 430 x 660 pts Page_size
- 81 Downloads / 220 Views
Department of Information Engineer Zhejiang Business Technology Institute, Ningbo, P.R. China {tjw, yqf}@zjbti.net.cn 2 College of Information Science and Engineering, Ningbo University, Ningbo, P.R. China [email protected]
Abstract. LPI is not efficient in time and memory which makes it difficult to be applied to very large data set. In this paper, we propose a optimal algorithm called FLPI which decomposes the LPI problem as a graph embedding problem plus a regularized least squares problem. Such modification avoids eigen decomposition of dense matrices and can significantly reduce both time and memory cost in computation. Moreover, with a specifically designed graph in supervised situation, LPI only needs to solve the regularized least squares problem which is a further saving of time and memory. Real and synthetic data experimental results show that FLPI obtains similar or better results comparing to LPI and it is significantly faster. Keywords: Locality preserving indexing (LPI), Latent semantic indexing (LSI), Document indexing, Dimensionality reduction.
1 Introduction Recently, Locality Preserving Indexing (LPI) is proposed to discover the discriminant structure of the document space. It has shown that it can have more discriminative power than LSI[1]. However, the computational complexity of LPI is very expensive because it involves eigen-decompositions of two dense matrices. It is almost infeasible to apply LPI on very large data set. m n Given a set of documents { xi }i=1 ⊂ R , which can be represented as a termdocument matrix X =[x1, x2, …, xm]. Suppose the rank of X is r, Latent Semantic Indexing(LSI) decompose the X by using SVD as follows:
X = U ∑V T ,
(1)
where ∑= diag(σ1, … , σr) and σ1 ≥σ2 ≥…≥σr are the singular values of X, U = [u1, … , ur] and ui’s are called left singular vectors, V = [v1, … , vr] and vi’s are called right singular vectors. Given a similarity matrix W, LPI can be obtained by solving the following minimization problem:
a* = arg min
m m
2
∑ ∑ ( aT xi − aT x j ) W ij = arg min aT XL X T a ,
aT XD X T a =1 i =1 j =1
aT XD X T a =1
G. Wang et al. (Eds.): RSKT 2008, LNAI 5009, pp. 644–651, 2008. © Springer-Verlag Berlin Heidelberg 2008
(2)
FLPI: An Optimal Algorithm for Document Indexing
645
where D is a diagonal matrix whose entries are column sums of W (Dii = ∑j Wji) and L = D−W is the graph Laplacian [2]. LPI constructs the similarity matrix W as:
⎧ xiT x j ⎪ W ij = ⎨ || xi |||| x j || ,if xi ∈ N p ( x j )or x j ∈ N p ( xi ) , ⎪ ⎩0 , otherwise .
(3)
where Np(xi) is the set of p nearest neighbors of xi. Thus, the objective function in LPI incurs a heavy penalty if neighboring points xi and xj are mapped far apart [5]. The basis functions of LPI are the eigenvectors associated with the smallest eigen values of the following generalized eigen-problem: XL X T a = λXD X T a . Since we have L = D−W, it is easy to check that the minimization problem in Eqn. (1) is equivalent to the following maximization problem:
a* = arg min aT XW X T a , aT XD X T a =1
(4)
and the op
Data Loading...