Script Identification Based on HSV Features

Many similar shaped scripts are used all over the world today. Scripts identification with similar shaped characters is one of the difficulties in script identification field and it need to be resolved. However, there are a little report about identificat

  • PDF / 1,640,225 Bytes
  • 10 Pages / 439.37 x 666.14 pts Page_size
  • 105 Downloads / 212 Views

DOWNLOAD

REPORT


School of Information Science and Engineering, Xinjiang University, Urumqi, 830046, Xinjiang, China [email protected] Network and Information Center, Xinjiang University, Urumqi, 830046, Xinjiang, China

Abstract. Many similar shaped scripts are used all over the world today. Scripts identification with similar shaped characters is one of the difficulties in script identification field and it need to be resolved. However, there are a little report about identification of Central Asian countries and Chinese Minority scripts, which identification of similar scripts. In this paper, a multi-script database was established, which are including 2200 plain document images with different reso‐ lution in 11 scripts such as English, Chinese, Arabic, Russian, Uyghur, Mongol, Tibet, Turkish, Kyrgyzstani, Uzbekistani and Tajikistani. Then, HSV features were extracted from each whole page image and they were classified by using BP neural network classifier. After experiment in our system, it is achieved 88.14 % of average identification rate and 99.0 % of highest identification rate in our experiment with the dataset. Experimental results indicated that HSV features were effective feature for identify these scripts. Keywords: Script identification · HSV features · BP neural network

1

Introduction

Script identification, identify different languages, is text category identification [1–3]. This is because using the same text or ethnic regions may speak different kinds of languages. In recent years, automatic script identification as the front part of the work of the OCR is becoming more popular. Along with the development of computer technology, information processing of minority is gradually becoming necessary work. In this study, text documents of different scripts were turned into as an image, and then the image is processed by digital image process technology. Since our aim is to identify the text document image classification from different scripts, however, the script identification research can be solved by considered being a typical pattern recognition problem. That being the case, any script identification system has the same structure as pattern recognition system. Script identification technology generally consists of several stages such as document image acquisition, image pre-processing, feature extraction and classification, in which these contents and methods of feature extraction is particularly important. © Springer Nature Singapore Pte Ltd. 2016 T. Tan et al. (Eds.): CCPR 2016, Part II, CCIS 663, pp. 588–597, 2016. DOI: 10.1007/978-981-10-3005-5_48

Script Identification Based on HSV Features

589

The earliest identification of scripts was in English and Latin [4], and then gradually oriented identification of the East Asian Languages and Latin scripts. Spitz [5] developed an approach for classifying Han-based that it is included Chinese, Japanese, Korean and Latin-based scripts. In this method, Han based script is performed by analysis of the distribution of optical density in the text images and Latin-based languages used a tech‐