An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

PDF / 5,843,997 Bytes
28 Pages / 439.642 x 666.49 pts Page_size
46 Downloads / 324 Views

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics Khader Mohammad1 · Aziz Qaroush1 · Mahdi Washha1 · Sos Agaian2 · Iyad Tumar1 Received: 4 March 2020 / Revised: 23 July 2020 / Accepted: 26 August 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract The performance of document text recognition depends on text line segmentation algorithms, which heavily relies on the type of language, author’s writing style, pen type, and document quality. In this paper, we present a novel unsupervised text-line segmentation algorithm for printed Arabic documents with and without diacritics. The presented approach employs a projection profile along with connected components in an iterative manner to detect text-lines. The primary benefits of the presented algorithm are (i) it is not threshold dependent, (ii) it is not required a training phase for threshold selection, and (iii) it is robust towards page rotation, font type, size, and style variation for both with and without diacritics documents. The extensive computational simulations on manually collected dataset prove the efficiency of the proposed scheme compared with several baseline and states of the art methods, including, Voronoi, X-Y Cut, Docstrum, Smearing and Seam-carving methods. Computational time analysis also presented. Keywords Arabic character recognition · Line segmentation · Baseline · Diacritics

Khader Mohammad

[email protected] Aziz Qaroush [email protected] Mahdi Washha [email protected] Sos Agaian [email protected] Iyad Tumar [email protected] 1

Department of Electrical and Computer Engineering, Birzeit University, Birzeit, Palestine

2

College of Staten Island, The City University of New York, New York, NY, USA

Multimedia Tools and Applications

1 Introduction Optical Character Recognition OCR is an automated process by which a text presented in a digital image is extracted and converted to an editable text [37]. In the literature, OCR systems can operate in two modes, either on-line or off-line [8]. The on-line OCR systems extract a set of pre-defined features (e.g. speed of drawing text and curvature tracking) immediately while users are writing. These systems are widely adopted in smartphones and tablets. On the other hand, the off-line systems recognize page segments, lines, words, and characters of stored scanned document images. Figure 1 [14]. Image acquisition is the first stage which aims to acquire document images using either scanners or digital cameras. Then, a set of preprocessing methods are applied to the input scanned document images to handle the common problems that appear after the scanning process such as noise, line skew, and text slant. The preprocessed/cleaned document images are then passed to the segmentation stage in which individual segments such as paragraphs, lines, words, and characters or sub-characters are extracted. The feature extraction stage performs analysis on each segmented component through extracting a set of discriminative featu

Data Loading...

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

Recommend Documents

An Improved Algorithm for AC Impedance Extraction

An Approximate Algorithm for Robust Adaptive Beamforming

Source Printer Authentication for Printed Documents Based on Factor Analysis

An Improved Adaptive Genetic Algorithm

Automatic Information Extraction from Scanned Documents

An adaptive algorithm for fast and reliable online saccade detection

An Adaptive Space-Sharing Scheduling Algorithm for PC-Based Clusters

An Evolutionary Algorithm for Adaptive Online Services in Dynamic Environment

An Adaptive Threshold Algorithm for Moving Object Segmentation

An Algorithm for Image Denoising Based on Adaptive Total Variation

Shuffled Frog Leaping Algorithm with Adaptive Exploration

An Executable Mechanised Formalisation of an Adaptive State Counting Algorithm