Recognition of E-Born PDF Including Mathematical Formulas

A new method to recognize STEM contents in “e-born PDF,” which is produced originally from an electronic file such as a Microsoft-Word document, LaTeX system, etc., is developed. Character information (the character code, the font type and the coordinates

  • PDF / 774,928 Bytes
  • 8 Pages / 439.37 x 666.142 pts Page_size
  • 92 Downloads / 214 Views

DOWNLOAD

REPORT


Institute of Mathematics for Industry, Kyushu University, 744, Motooka, Nishi-ku, Fukuoka 819-0395, Japan [email protected] 2 Junior College Funabashi Campus, Nihon University, 7-24-1 Narashinodai, Funabashi, Chiba 274-8501, Japan [email protected]

Abstract. A new method to recognize STEM contents in “e-born PDF,” which is produced originally from an electronic file such as a Microsoft-Word document, LaTeX system, etc., is developed. Character information (the character code, the font type and the coordinates on a page) extracted directly from a document is combined with analysis technologies in Math OCR. It improves recognition rate for STEM contents in e-born PDF remarkably, compared with ordinary image-based OCR approaches. This new method is actually implemented in our math OCR system (InftyReader). Keywords: STEM

1

· OCR · E-born PDF · Accessibility

Introduction

We believe that one of the most serious problems in digitized STEM (science, technology, engineering and mathematics) contents, which are provided in PDF in most cases, is their poor accessibility. Not only on the web, but PDF is commonly used for the exchange of STEM contents among researchers or in various educational fields. To guarantee accessibility, it is now increasing that publishers provide print-disabled customers with a book in PDF as an alternative media for the printed one. In many cases, print-disabled people use OCR (optical character recognition) software to read those PDF; however, that is not always successful. There have been many researches on computerized recognition for the scanned image of a document or PDF including mathematical formulas in aims of information retrieval, improving accessibility and so on [1]. Our research group also started the development of an OCR system for mathematical documents since the late 1990s and has already put software “InftyReader” to practical use [2–4]. It should be pointed out, however, that all of them are so-called “imagebased” approaches, so far; that is, PDF is converted into image files once before getting into the analysis, and then, OCR process is applied. As far as they adopt totally image-based processing, a certain percentage of recognition errors should be unavoidable. c Springer International Publishing Switzerland 2016  K. Miesenberger et al. (Eds.): ICCHP 2016, Part I, LNCS 9758, pp. 35–42, 2016. DOI: 10.1007/978-3-319-41264-1 5

36

M. Suzuki and K. Yamaguchi

To make the situation much clearer, here, we classify PDF into two types. In this paper, we call as “e-born PDF” PDF that is produced originally from an electronic file such as a document in Microsoft Word, LaTeX, Adobe InDesign, etc. We refer to the other type as “image PDF,” which is usually made by scanning and contains only images. When viewing them in a regular size, there may be no significant difference between them. However, while the quality of character images in the latter should commonly become worse in zooming up, it should be kept to be fine in the former no matter how large characters are magnified