Handwritten Word Image Categorization with Convolutional Neural Networks and Spatial Pyramid Pooling

The extraction of relevant information from historical document collections is one of the key steps in order to make these documents available for access and searches. The usual approach combines transcription and grammars in order to extract semantically

  • PDF / 1,283,855 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 89 Downloads / 202 Views

DOWNLOAD

REPORT


Scytl Secure Electronic Voting, Barcelona, Spain {JuanIgnacio.Toledo,Jordi.Cucurull}@scytl.com 2 Computer Vision Center, Universitat Aut` onoma de Barcelona, Barcelona, Spain {afornes,josep}@cvc.uab.es 3 Department of Computer Science, TU Dortmund University, Dortmund, Germany {sebastian.sudholt,gernot.fink}@tu-dortmund.de

Abstract. The extraction of relevant information from historical document collections is one of the key steps in order to make these documents available for access and searches. The usual approach combines transcription and grammars in order to extract semantically meaningful entities. In this paper, we describe a new method to obtain word categories directly from non-preprocessed handwritten word images. The method can be used to directly extract information, being an alternative to the transcription. Thus it can be used as a first step in any kind of syntactical analysis. The approach is based on Convolutional Neural Networks with a Spatial Pyramid Pooling layer to deal with the different shapes of the input images. We performed the experiments on a historical marriage record dataset, obtaining promising results. Keywords: Document image analysis · Word image categorization Convolutional neural networks · Named entity detection

1

·

Introduction

Document Image Analysis and Recognition (DIAR) is the pattern recognition research field devoted to the analysis, recognition and understanding of images of documents. Within this field, one of the most challenging tasks is handwriting recognition [3,6], defined as the task of converting the text contained in a document image into a machine readable format. Indeed, after decades of research, this task is still considered an open problem, specially when dealing with historical manuscripts. The main difficulties are: paper degradation, differences in the handwriting style across centuries, and old vocabulary and syntax. Generally speaking, handwriting recognition relies on the combination of two models, the optical model and the linguistic model. The former is able to recognize the visual shape of characters or graphemes, and the second interprets c Springer International Publishing AG 2016  A. Robles-Kelly et al. (Eds.): S+SSPR 2016, LNCS 10029, pp. 543–552, 2016. DOI: 10.1007/978-3-319-49055-7 48

544

J.I. Toledo et al.

them in their context based on some structural rules. The linguistic model can range from simple n-grams (probabilities of character or word sequences), to sophisticated syntactic formalisms enriched with semantic information. In this paper we focus in this last concept. Our proposed hipothesis is that in certain conditions where the text can be roughly described by a grammatical structure, the identification of named entities can boost the recognition in a parsing process. Named entity recognition is an information extraction problem consisting in detecting and classifying the text terms into pre-defined categories such as the names of people, streets, organizations, dates, etc. It can also be seen as the semantic annotation of text element