An automatic histogram detection and information extraction from document images

  • PDF / 1,357,187 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 61 Downloads / 300 Views

DOWNLOAD

REPORT


An automatic histogram detection and information extraction from document images P. H. Anagha1 · A. Baskar1 Received: 12 December 2019 / Accepted: 25 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Histogram is an important data chart that is commonly present in scientific documents. In this paper, an automatic histogram detection and information extraction methodology, based on Hough line detector and Morphological operator, is proposed. The proffered system is comprised of three steps: pre-processing, axis detection, and chart pattern extraction. In the preprocessing step, the RGB image pattern of a histogram is converted into a binary image. Next, in the axis detection step, horizontal axis, vertical axis and title of the histogram are extracted. In this step Hough line detector methodology was applied to detect horizontal and vertical lines in the image patterns. From the set of identified vertical lines, both the endpoints of a line, having the same minimum values of x co-ordinate was considered as a vertical axis. Similarly, from the set of identified horizontal lines, the two endpoints of a line having the same maximum values of y co-ordinate were considered as a horizontal axis. With respect to the dimensions of the horizontal axis and vertical axis, a rectangular region containing horizontal axis values and label, vertical axis values and label and title are extracted. In the final chart pattern extraction step, using morphological operations, the frequency of data present in the histogram was identified. Verification and validation tests of the propounded system yielded promising results, indicative of efficient approach for extraction of histogram information. Keywords  Histogram · Hough line detector · Morphological operator · Information · Extraction

1 Introduction Histograms are the representation of information in a compact form; they are widely used in scientific assessments of numerical data acquired in any field of research. A typical histogram is constituted of a title, horizontal axis, horizontal axis label, vertical axis, vertical axis label, and bars in between the horizontal axis and vertical axis. Robotics is one of the most challenging and interesting fields of research. If a robot is capable of parsing data charts, then it can act as a data scientist. This paper introduces a novel method for the automatic detection of histograms and extraction of information.

* A. Baskar [email protected] P. H. Anagha [email protected] 1



Dept of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India

Al-Zaidy and Giles (2015), Sindhuja  and Baskar (2017) and Elzer et al. (2006, 2011) proposed a method for automatic extraction of data from bar charts. Data values were extracted from the chart using dilation and connected component analysis. An OCR was used to detect the extracted information. The system fails to detect charts if the image quality is low. Al-Zaidy et al. (2016