Indexing of Textual Databases Based on Lexical Resources: A Case Study for Serbian

In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and named entity recognition. The approach was applied on a database of geological projects fina

  • PDF / 1,026,109 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 68 Downloads / 229 Views

DOWNLOAD

REPORT


aculty of Mining and Geology, University of Belgrade, Belgrade, Serbia {ranka,ivan.obradovic,olivera.kitanovic}@rgf.bg.ac.rs 2 Faculty of Philology, University of Belgrade, Belgrade, Serbia [email protected]

Abstract. In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and named entity recognition. The approach was applied on a database of geological projects financed by the Republic of Serbia for several decades now. Each document within this database is described by a summary report, consisting of metadata on the geological project, such as title, domain, keywords, abstract, and geographical location. A bag of words was produced from these metadata with the help of morphological dictionaries and transducers, while named entities were recognized using a rule-based system. Both were then used for pre-indexing documents for information retrieval purposes where ranking of retrieved documents was based on several tf idf based measures. Evaluation of ranked retrieval results based on data obtained by pre-indexing were compared to results obtained by informational retrieval without pre-indexing with precision-recall curve, showing a significant improvement in terms of the mean average precision measure.

1

Introduction

Three basic problems related to Information Retrieval (IR) are the presentation of document content, the presentation of information needs and the comparison of these two representations. If the search is performed by scanning textual documents, then their additional representation is not required. However, in order to increase efficiency, especially in the case of large collections, a formal representation surrogate of each document is usually formed. Representation of a document as a rule contains metadata about the document, such as title, abstract, author and assigned index terms referring to document content. Automatic assigning of a surrogate can also be performed by extracting and selecting specific terms (words) that appear in the document text. To that end, many natural language processing (NLP) methods and techniques are used: determining the boundaries of sentences, tokenization, stemming, tagging, recognition of nominal phrases and named entities and, finally, parsing [7]. Based on these representations, during the preparatory phase an index of the collection of documents is formed, which is then used in the search phase. c Springer International Publishing Switzerland 2015  J. Cardoso et al. (Eds.): KEYWORD 2015, LNCS 9398, pp. 167–181, 2015. DOI: 10.1007/978-3-319-27932-9 15

168

R. Stankovi´c et al.

Finding and ranking of relevant documents on basis of the index is realized using the model of approximate matching, based on the frequency distribution of terms and documents. Two basic approaches are the vector space model, based on weight coefficients of terms, and the probabilistic model, based on relevance feedback [18]. Serbian belongs to a group of Less-Resourced Languages for which