Language-based document image retrieval for Trilingual System

  • PDF / 3,944,626 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 23 Downloads / 196 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Language-based document image retrieval for Trilingual System Umesh D. Dixit1



M. S. Shirdhonkar2

Received: 3 February 2019 / Accepted: 7 November 2019  Bharati Vidyapeeth’s Institute of Computer Applications and Management 2019

Abstract Language-based document image retrieval (LBDIR) is an essential need for a multi-lingual environment. It provides an ease of accessing, searching and browsing of the documents pertaining to a particular language. This paper proposes a method for LBDIR using multi-resolution Histogram of Oriented Gradient (HOG) features. These features are obtained by computing HOG on the sub-bands of Discrete Wavelet Transform. The Canberra distance is used for matching and retrieval of the documents. The proposed scheme is investigated on the three datasets (Dataset1, Dataset2 and Dataset3) consisting of 1437 document images of Kannada, Marathi, Telugu, Hindi and English languages. The objective of this work is to provide an efficient LBDIR for the government and nongovernment organizations of Karnataka, Maharashtra and Telangana states with the context of the tri-lingual model adopted. An average precision (AP) of 96.2%, 95.4%, 94.6%, 99.4% and 99.6% for Kannada, Marathi, Telugu, Hindi and English language documents is achieved while retrieving top 50 documents with the proposed method. The proposed feature extraction scheme provided promising results compared to existing techniques.

& Umesh D. Dixit [email protected] M. S. Shirdhonkar [email protected] 1

Department of Electronics and Communication Engineering, B.L.D.E.A’s V. P. Dr. P.G. Halakatti College of Engineering and Technology, Vijayapur 586103, India

2

Department of Computer Science and Engineering, B.L.D.E.A’s V. P. Dr. P. G. Halakatti College of Engineering and Technology, Vijayapur 586103, India

Keywords Document image retrieval  HOG  DWT  Similarity metric  Canberra distance Abbreviations HOG Histogram of oriented gradients DWT Discrete Wavelet Transform DCT Discrete Cosine Transform LBDIR Language-Based Document Image Retrieval LBP Local Binary Pattern RI-LBP Rotation Invariant Local Binary Pattern PCA Principal Component Analysis P Precision AP Average precision SVM Support Vector Machine KNN K-Nearest Neighbor

1 Introduction The rapid growth of technology has lead to digitization of documents in almost every part of the world. Many techniques have been developed for the retrieval of documents such as logo-based, signature-based, lay-out based, facebased, etc. But these techniques are independent of the language content of documents. When the repository includes documents of different languages, there is a need for LBDIR system. India is a multi-lingual country and has 18 regional languages. The officially accepted languages of India are Assamese, Bangla, English, Guajarati, Hindi, Konkani, Kannada, Kashmiri, Malayalam, Marathi, Nepali, Oriya, Punjabi, Rajasthani, Sanskrit, Tamil, Telugu and Urdu [1]. Almost every state of India has adopted a three-language policy: A regional lang