ARDIS: a Swedish historical handwritten digit dataset

  • PDF / 1,217,112 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 49 Downloads / 278 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

IAPR-MEDPRAI

ARDIS: a Swedish historical handwritten digit dataset Huseyin Kusetogullari1 • Amir Yavariabdi2 • Abbas Cheddad1



Håkan Grahn1 • Johan Hall3

Received: 22 October 2018 / Accepted: 19 March 2019 Ó The Author(s) 2019

Abstract This paper introduces a new image-based handwritten historical digit dataset named Arkiv Digital Sweden (ARDIS). The images in ARDIS dataset are extracted from 15,000 Swedish church records which were written by different priests with various handwriting styles in the nineteenth and twentieth centuries. The constructed dataset consists of three single-digit datasets and one-digit string dataset. The digit string dataset includes 10,000 samples in red–green–blue color space, whereas the other datasets contain 7600 single-digit images in different color spaces. An extensive analysis of machine learning methods on several digit datasets is carried out. Additionally, correlation between ARDIS and existing digit datasets Modified National Institute of Standards and Technology (MNIST) and US Postal Service (USPS) is investigated. Experimental results show that machine learning algorithms, including deep learning methods, provide low recognition accuracy as they face difficulties when trained on existing datasets and tested on ARDIS dataset. Accordingly, convolutional neural network trained on MNIST and USPS and tested on ARDIS provide the highest accuracies 58:80% and 35:44%, respectively. Consequently, the results reveal that machine learning methods trained on existing datasets can have difficulties to recognize digits effectively on our dataset which proves that ARDIS dataset has unique characteristics. This dataset is publicly available for the research community to further advance handwritten digit recognition algorithms. Keywords Handwritten digit recognition  ARDIS dataset  Machine learning methods  Benchmark

1 Introduction

& Abbas Cheddad [email protected] Huseyin Kusetogullari [email protected]; [email protected] Amir Yavariabdi [email protected] Ha˚kan Grahn [email protected] Johan Hall [email protected] 1

Department of Computer Science and Engineering, Blekinge Institute of Technology, 37141 Karlskrona, Sweden

2

Department of Mechatronics Engineering, KTO Karatay University, Konya, Turkey

3

Arkiv Digital, Va¨xjo¨, Sweden

Recently, digitization of handwritten documents has become significantly important to protect and store data more efficiently. The growth of digitized handwritten documents highlights new types of challenges and problems which lead to development of many automated and computerized analysis systems. Generally, the developed frameworks have been used to resolve various problems such as character recognition, identity prediction, digit segmentation and recognition, document binarization, automatic analysis of birth, marriage and death records, and many others [1–5]. Among them, this paper focuses on the handwritten digit recognition problem. In the last three