Hand pose aware multimodal isolated sign language recognition

  • PDF / 5,431,826 Bytes
  • 37 Pages / 439.37 x 666.142 pts Page_size
  • 93 Downloads / 255 Views

DOWNLOAD

REPORT


Hand pose aware multimodal isolated sign language recognition Razieh Rastgoo 1 & Kourosh Kiani 1

& Sergio Escalera

2

Received: 21 March 2020 / Revised: 9 July 2020 / Accepted: 21 August 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Isolated hand sign language recognition from video is a challenging research area in computer vision. Some of the most important challenges in this area include dealing with hand occlusion, fast hand movement, illumination changes, or background complexity. While most of the state-of-the-art results in the field have been achieved using deep learning-based models, the previous challenges are not completely solved. In this paper, we propose a hand pose aware model for isolated hand sign language recognition using deep learning approaches from two input modalities, RGB and depth videos. Four spatial feature types: pixellevel, flow, deep hand, and hand pose features, fused from both visual modalities, are input to LSTM for temporal sign recognition. While we use Optical Flow (OF) for flow information in RGB video inputs, Scene Flow (SF) is used for depth video inputs. By including hand pose features, we show a consistent performance improvement of the sign language recognition model. To the best of our knowledge, this is the first time that this discriminant spatiotemporal features, benefiting from the hand pose estimation features and multi-modal inputs, are fused for isolated hand sign language recognition. We perform a step-by-step analysis of the impact in terms of recognition performance of the hand pose features, different combinations of the spatial features, and different recurrent models, especially LSTM and GRU. Results on four public datasets confirm that the proposed model outperforms the current state-of-the-art models on Montalbano II, MSR Daily Activity 3D, and CAD-60 datasets with a relative accuracy improvement of 1.64%, 6.5%, and 7.6%. Furthermore, our model obtains a competitive results on isoGD dataset with only 0.22% margin lower than the current state-of-the-art model. Keywords Sign language . Deep learning . Multimodal . Hand pose estimation . Scene flow

* Kourosh Kiani [email protected] Extended author information available on the last page of the article

Multimedia Tools and Applications

1 Introduction While most of the people in society communicate with each other using different natural language types such as body language, hand gestures, facial expression, speech, writing, lip motion, and so on, sign language, as a special language, tries to provide an understandable communication between the hearing or talking disable people and the usual people. Proposing an efficient model to facilitate this communication can have a significant impact on social life of the hearing or talking disable people [7]. Sign language recognition, especially hand sign recognition, is not a new computer vision challenge. Different models have been suggested to improve the accuracy as well as the speed of recognition in this area [43]