Multi-view region-adaptive multi-temporal DMM and RGB action recognition

  • PDF / 1,714,563 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 32 Downloads / 163 Views

DOWNLOAD

REPORT


THEORETICAL ADVANCES

Multi‑view region‑adaptive multi‑temporal DMM and RGB action recognition Mahmoud Al‑Faris1 · John P. Chiverton1   · Yanyan Yang2 · David Ndzi3 Received: 11 April 2019 / Accepted: 3 April 2020 © The Author(s) 2020

Abstract Human action recognition remains an important yet challenging task. This work proposes a novel action recognition system. It uses a novel multi-view region-adaptive multi-resolution-in-time depth motion map (MV-RAMDMM) formulation combined with appearance information. Multi-stream 3D convolutional neural networks (CNNs) are trained on the different views and time resolutions of the region-adaptive depth motion maps. Multiple views are synthesised to enhance the view invariance. The region-adaptive weights, based on localised motion, accentuate and differentiate parts of actions possessing faster motion. Dedicated 3D CNN streams for multi-time resolution appearance information are also included. These help to identify and differentiate between small object interactions. A pre-trained 3D-CNN is used here with fine-tuning for each stream along with multi-class support vector machines. Average score fusion is used on the output. The developed approach is capable of recognising both human action and human–object interaction. Three public-domain data-sets, namely MSR 3D Action, Northwestern UCLA multi-view actions and MSR 3D daily activity, are used to evaluate the proposed solution. The experimental results demonstrate the robustness of this approach compared with state-of-the-art algorithms. Keywords  Action recognition · DMM · 3D CNN · Region adaptive

1 Introduction Action recognition is a key step in many amazing applications areas. Potential areas of interest are wide. They include automated security monitoring, [1]; social applications [2]; intelligent transportation [3]; smart hospitals [4]; and homes [5]. Action recognition methods can be based on a number of different sources of features such as space-time interest points [6], improved trajectories of features and fisher vectors [7, 8]. These techniques model motion in video data which are obviously an important source of information that can be used to help recognise actions. Instead of points of motion, less localised sources of motion can also be * John P. Chiverton [email protected] 1



School of Energy and Electronic Engineering, University of Portsmouth, Portsmouth PO1 3DJ, UK

2



School of Computing, University of Portsmouth, Portsmouth PO1 3HE, UK

3

School of Computing, Engineering and Physical Sciences, University of the West of Scotland, Paisley PA1 2BE, UK



considered to model the motion of the body as a whole such as motion history images (MHIs) [9] and for the boundary as with motion boundary histograms (MBHs) [7]. Depth can also be incorporated with techniques such as depth motion maps (DMMs) [10]. These sources of, what might be considered handcrafted features are rich in information but not necessarily always able to capture all the relevant aspects of motion that might be needed to help