Indexing of Textual Databases Based on Lexical Resources: A Case Study for Serbian

In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and named entity recognition. The approach was applied on a database of geological projects fina

PDF / 1,026,109 Bytes
15 Pages / 439.37 x 666.142 pts Page_size
68 Downloads / 229 Views

DOWNLOAD

REPORT

aculty of Mining and Geology, University of Belgrade, Belgrade, Serbia {ranka,ivan.obradovic,olivera.kitanovic}@rgf.bg.ac.rs 2 Faculty of Philology, University of Belgrade, Belgrade, Serbia [email protected]

Abstract. In this paper we describe an approach to improvement of information retrieval results for large textual databases by pre-indexing documents using bag-of-words and named entity recognition. The approach was applied on a database of geological projects ﬁnanced by the Republic of Serbia for several decades now. Each document within this database is described by a summary report, consisting of metadata on the geological project, such as title, domain, keywords, abstract, and geographical location. A bag of words was produced from these metadata with the help of morphological dictionaries and transducers, while named entities were recognized using a rule-based system. Both were then used for pre-indexing documents for information retrieval purposes where ranking of retrieved documents was based on several tf idf based measures. Evaluation of ranked retrieval results based on data obtained by pre-indexing were compared to results obtained by informational retrieval without pre-indexing with precision-recall curve, showing a signiﬁcant improvement in terms of the mean average precision measure.

1

Introduction

Three basic problems related to Information Retrieval (IR) are the presentation of document content, the presentation of information needs and the comparison of these two representations. If the search is performed by scanning textual documents, then their additional representation is not required. However, in order to increase eﬃciency, especially in the case of large collections, a formal representation surrogate of each document is usually formed. Representation of a document as a rule contains metadata about the document, such as title, abstract, author and assigned index terms referring to document content. Automatic assigning of a surrogate can also be performed by extracting and selecting speciﬁc terms (words) that appear in the document text. To that end, many natural language processing (NLP) methods and techniques are used: determining the boundaries of sentences, tokenization, stemming, tagging, recognition of nominal phrases and named entities and, ﬁnally, parsing [7]. Based on these representations, during the preparatory phase an index of the collection of documents is formed, which is then used in the search phase. c Springer International Publishing Switzerland 2015 J. Cardoso et al. (Eds.): KEYWORD 2015, LNCS 9398, pp. 167–181, 2015. DOI: 10.1007/978-3-319-27932-9 15

168

R. Stankovi´c et al.

Finding and ranking of relevant documents on basis of the index is realized using the model of approximate matching, based on the frequency distribution of terms and documents. Two basic approaches are the vector space model, based on weight coeﬃcients of terms, and the probabilistic model, based on relevance feedback [18]. Serbian belongs to a group of Less-Resourced Languages for which

Data Loading...

Indexing of Textual Databases Based on Lexical Resources: A Case Study for Serbian

Recommend Documents

Lexical Analysis of Textual Data

Indexing Spatial Constraint Databases

Study on the division of main functional regions based on relative carrying capacity of resources: a case study of Guiya

Perceptual Training on Lexical Stress Contrasts A Study with Taiwane

Evaluation of Global Water Resources Reanalysis Runoff Products for Local Water Resources Applications: Case Study-Upper

International service trade and its implications for human resources for health: a case study of Thailand

Stocks Clustering Based on Textual Embeddings for Price Forecasting

Semantics-Based Composition for Textual Requirements

A Data-Driven Multidimensional Indexing Method for Data Mining in Astrophysical Databases

Constructing Linguistic Resources for the Tunisian Dialect Using Textual User-Generated Contents on the Social Web

Organizational Commitment and Entrepreneurial Intentions Among Employed Persons: Serbian Case

Tree-based Indexing