Enhancing feature fusion for human pose estimation
- PDF / 1,354,128 Bytes
- 9 Pages / 595.276 x 790.866 pts Page_size
- 96 Downloads / 260 Views
ORIGINAL PAPER
Enhancing feature fusion for human pose estimation Rui Wang1 · Jiangwei Tong1 · Xiangyang Wang1 Received: 6 January 2020 / Revised: 12 May 2020 / Accepted: 17 July 2020 / Published online: 24 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract Current human pose estimation methods mainly rely on designing efficient Convolutional Neural Networks (CNN) frameworks. These CNN architectures typically consist of high-to-low resolution sub-networks to learn semantic information, and then followed by low-to-high sub-networks to raise the resolution to locate the keypoints. Because low-level features have high resolution but less semantic information, while high-level features have rich semantic information but less high resolution details, so it is important to fuse different level features to improve the final performance. However, most existing models implement feature fusion by simply concatenate low-level and high-level features without considering the gap between spatial resolution and semantic levels. In this paper, we propose a new feature fusion method for human pose estimation. We introduce high level semantic information into low-level features to enhance feature fusion. Further, to keep both the high-level semantic information and high-resolution location details, we use Global Convolutional Network blocks to bridge the gap between low-level and high-level features. Experiments on MPII and LSP human pose estimation datasets demonstrate that efficient feature fusion can significantly improve the performance. The code is available at: https://github.com/tongjiangwei/ FeatureFusion. Keywords Human pose estimation · Convolutional neural networks · Feature fusion · Global convolutional network (GCN )
1 Introduction 2D human pose estimation (HPE) is a challenging problem in computer vision. It aims to recognize and locate the human anatomical keypoints in the images, which is fundamental for other applications like human action recognition, humancomputer interaction and animation. Recently, most methods of human pose estimation have achieved the state-of-the-art performance by using Convolutional neural networks (CNNs). For instance, as shown in Fig. 1a. Hourglass [1] proposes an exemplary encoder-decoder structure, the encoder consists of high-to-low networks and the decoder recovers the full resolution through a low-tohigh process. PyraNet [2] introduces a Pyramid Residual Module to learn image features in multi-scale convolutional
B
Xiangyang Wang [email protected] Rui Wang [email protected] Jiangwei Tong [email protected]
1
School of Communication and Information Engineering, Shanghai University, Shanghai, China
networks. Furthermore, based on the backbone of ResNet [3], SimpleBaseline [4] adopts transposed convolution operation to restore high-resolution representations. Despite of these great progresses, there still exist some challenges, such as occluded keypoints and clutter backgrounds. The main reason is: these networks do not properly handle the
Data Loading...