Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition

  • PDF / 3,056,417 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 7 Downloads / 252 Views

DOWNLOAD

REPORT


REGULAR PAPER

Deep learning‑based multi‑modal approach using RGB and skeleton sequences for human activity recognition Pratishtha Verma1 · Animesh Sah1 · Rajeev Srivastava1 Received: 23 January 2020 / Accepted: 13 July 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract The deep learning techniques have achieved great success in the application of human activity recognition (HAR). In this paper, we propose a technique for HAR that utilizes the RGB and skeleton information with the help of a convolutional neural network (Convnet) and long short-term memory (LSTM) as a recurrent neural network (RNN). The proposed method has two parts: first, motion representation images like motion history image (MHI) and motion energy image (MEI) have been created from the RGB videos. The convnet has been trained, using these images with feature-level fusion. Second, the skeleton data have been utilized with a proposed algorithm that develops skeleton intensity images, for three views (top, front and side). Each view is first analyzed by a convnet, that generates the set of feature maps, which are fused for further analysis. On top of convnet sub-networks, LSTM has been used to exploit the temporal dependency. The softmax scores from these two independent parts are later combined at the decision level. Apart from the given approach for HAR, this paper also presents a strategy that utilizes the concept of cyclic learning rate to develop a multi-modal neural network by training the model only once to make the system more efficient. The suggested approach privileges for the perfect utilization of RGB and skeleton data available from an RGB-D sensor. The proposed approach has been tested on three famous and challenging multimodal datasets which are UTD-MHAD, CAD-60 and NTU-RGB + D120. Results have shown that the stated method gives a satisfactory result as compared to the other state-of-the-art systems. Keywords  Human activity recognition (HAR) · Recurrent neural network (RNN) · Convolutional neural network (convnet) · Motion energy image (MEI) · Motion history image (MHI) · Weighted product model (WPM)

1 Introduction Human activity recognition (HAR) is an active research area in the field of computer vision. Human activities have been observed as a series of basic movements. As an instance, activities like hand waving and brushing hair have been represented as a series of continuous lowering and raising of the hand. The HAR system recognizes these activities performed in videos and the images. The system has wide Communicated by Y. Kong. * Pratishtha Verma [email protected] Animesh Sah [email protected] Rajeev Srivastava [email protected] 1



Indian Institute of Technology (BHU), Varanasi, India

applications in the fields of robotics, security, industrial automation, and human–computer interaction. The key challenge of the HAR is to identify the class of actions robustly regardless of the variations in external conditions and clothing of people performing the actions. Man