Fast and robust key frame extraction method for gesture video based on high-level feature representation

  • PDF / 1,659,989 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 47 Downloads / 217 Views

DOWNLOAD

REPORT


ORIGINAL PAPER

Fast and robust key frame extraction method for gesture video based on high-level feature representation Huimin Yang1 · Qiuhong Tian1

· Qiaoli Zhuang1 · Linye Li1 · Qinglong Liang1

Received: 15 March 2020 / Revised: 12 August 2020 / Accepted: 11 September 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract In gesture video, the inner-frame difference is too subtle to be projected via low-level features, and the gesture frames, expressing semantic information, are distributed only among the tiny part of the whole video frame. This paper introduces a fast and robust key frame extraction method for gesture video, founded upon high-level feature representation to extract the gesture key frame precisely without affecting the semantic information. Firstly, a gesture video segmentation model is designed by employing SSD, which classify gesture video into the semantic scene and the static scene. And then, the 2DDWT-based perceptual hash algorithm is studied to extract candidate static key frames. Afterward, the multi-channel gradient magnitude frequency histogram (HGMF-MC) based on improved VGG16 is developed as a new image descriptor. Finally, a key frame extraction mechanism based on HGMF-MC is proposed to generate gesture video summary of two scenes, respectively. Experiments consistently show the superiority of the proposed method on Chinese sign language, Cambridge, ChaLearn and CVRR-Hands gesture datasets. The results demonstrate that the method proposed is effective, which improves the video compression ratio and outperforms the state-of-the-art methods. Keywords Gesture video classification · Improved VGG16 · The histogram of gradient magnitude frequency · 2D-DWT-based perceptual hash

1 Introduction Gesture recognition (GR) is playing a dominant role in human–computer interaction, and dynamic gesture recognition can better meet the real-time needs of human–computer interaction. A dynamic gesture video can often be converted into hundreds of video frames; analyzing and processing such an amount of data is a complex and time-consuming task. Key frame extraction of gesture video decreases the amount of processing data, and it can improve the real-time performance of the gesture recognition algorithm. Therefore, key frame extraction is an effective method to generate a video summary. The extracted key frames should represent the sequence information of the whole gesture video without missing important content. At the same time, the extracted key frames are not similar to each other.

B

Qiuhong Tian [email protected] Huimin Yang [email protected]

1

Zhejiang Sci-Tech university, Hangzhou, China

At present, key frame extraction algorithms based on video can be categorized into four classes: clustering based, motion information based, video segmentation based and deep learning based. Clustering is a widely used key frame extraction approach [1]. Video frames are clustered according to similarity, and the frame, which is closest to each cluster center, is selected as the key