Investigating the Optimise k-Dimensions and Threshold Values of Latent Semantic Indexing Retrieval Performance for Small

Presenting users with relevant feedback is the main aim and core in information retrieval (IR). Due to the poor relevance feedback returned by simple exact term-matching technique, a latent semantic indexing (LSI) based IR has come into place to overcome

  • PDF / 159,425 Bytes
  • 12 Pages / 439.37 x 666.142 pts Page_size
  • 8 Downloads / 175 Views

DOWNLOAD

REPORT


Investigating the Optimise k-Dimensions and Threshold Values of Latent Semantic Indexing Retrieval Performance for Small Malay Language Corpus Roslan Sadjirin, Noli Maishara Nordin, Mohd Ikhsan Md Raus and Zulazeze Sahri

Abstract Presenting users with relevant feedback is the main aim and core in information retrieval (IR). Due to the poor relevance feedback returned by simple exact term-matching technique, a latent semantic indexing (LSI) based IR has come into place to overcome the retrieval drawback, and improve the effectiveness of retrieval performance. In other words, LSI-based IR aims in satisfying users rather than satisfying a given query. However, in developing an LSI-based information retrieval application, there are parameters that need to be considered in order to produce relevant feedback which optimise the precision and recall in retrieval process. Therefore, this paper investigates two important parameters that characterised the retrieval performance, which are the optimise k-dimension to represent terms and documents in corpus, and the optimise threshold values for the documents to be accepted, judged and returned as relevant for a given term query. A small Malay corpus which comprises of 1395 Malay language documents and terms were used as the test collection. The analyses suggest that the effective performance of the retrieval which satisfied as well as balanced the precision and recall, is obtained for k-dimension is k = 4 and threshold value is ε = 0.8 The study helps the software developers particularly the IR application developers in

R. Sadjirin (&)  M.I. Md Raus  Z. Sahri Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Jengka, Pahang, Malaysia e-mail: [email protected] M.I. Md Raus e-mail: [email protected] Z. Sahri e-mail: [email protected] N.M. Nordin Academy of Language Studies, Universiti Teknologi MARA, Jengka, Pahang, Malaysia e-mail: [email protected] © Springer Science+Business Media Singapore 2016 N.A. Yacob et al. (eds.), Regional Conference on Science, Technology and Social Sciences (RCSTSS 2014), DOI 10.1007/978-981-10-0534-3_31

325

326

R. Sadjirin et al.

designing and choosing the optimise value of the k-dimension and the threshold in the search engine. Keywords Information retrieval decomposition Optimisation





Latent semantic analysis



Singular value

1 Introduction Information retrieval (IR) is a part of computer science studies which finding for documents from a collection of stored documents that are relevant to a user’s need for information (Russell and Norvig 2010). The retrieved document aims at satisfying user’s information need usually expressed in natural language (Baeza-Yates 2004; Ricardo and Berthier 2011). The best known examples of information retrieval system are search engines on the World Wide Web. A Web user can type a query in natural language into a search engine and see a list of relevant pages (Russell and Norvig 2010). Therefore, the representation and organisation of the