RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild

  • PDF / 2,942,843 Bytes
  • 17 Pages / 595.276 x 790.866 pts Page_size
  • 85 Downloads / 158 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

ORIGINAL ARTICLE

RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild Rafael Berral-Soler1 • Francisco J. Madrid-Cuevas1 • Rafael Mun˜oz-Salinas1 • Manuel J. Marı´n-Jime´nez1 Received: 11 May 2020 / Accepted: 4 November 2020  Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Human head pose estimation in images has applications in many fields such as human–computer interaction or video surveillance tasks. In this work, we address this problem, defined here as the estimation of both vertical (tilt/pitch) and horizontal (pan/yaw) angles, through the use of a single Convolutional Neural Network (ConvNet) model, trying to balance precision and inference speed in order to maximize its usability in real-world applications. Our model is trained over the combination of two datasets: ‘Pointing’04’ (aiming at covering a wide range of poses) and ‘Annotated Facial Landmarks in the Wild’ (in order to improve robustness of our model for its use on real-world images). Three different partitions of the combined dataset are defined and used for training, validation and testing purposes. As a result of this work, we have obtained a trained ConvNet model, coined RealHePoNet, that given a low-resolution grayscale input image, and without the need of using facial landmarks, is able to estimate with low error both tilt and pan angles (4:4 average error on the test partition). Also, given its low inference time (6 ms per head), we consider our model usable even when paired with medium-spec hardware (i.e. GTX 1060 GPU). Code available at: https://github.com/rafabs97/headpose_final Demo video at: https://www.youtube.com/watch?v=2UeuXh5DjAE. Keywords Human head pose estimation  ConvNets  Human–computer interaction  Deep Learning Abbreviations AFLW Annotated Facial Landmarks in the Wild CNN Convolutional Neural Network Conv Convolution ConvNet Convolutional Neural Network CT Confidence Threshold FC Fully connected flops Floating point operations per second FPS Frames per second HPE Head pose estimation IoU Intersection over Union MAE Mean Absolute Error MSE Mean Squared Error SSD Single Shot Detector

& Manuel J. Marı´n-Jime´nez [email protected] 1

Department of Computing and Numerical Analysis, University of Cordoba, Cordoba, Spain

1 Introduction Given a human head detected in a picture, we can define the task of head pose estimation (HPE) as the estimation, relative to the camera, of both vertical (tilt/pitch) and horizontal (pan/yaw) angles (see Fig. 1)—a third angle (roll) can be estimated, but it falls outside the scope of this work. Human head pose estimation is useful in many situations: for instance in vehicles (detecting if the driver of a vehicle is paying attention to the road [31]), human–computer interaction (detecting where the user’s attention is being drawn [44]), social interaction understanding (detecting if people is looking at each other [28]), video surveillance systems [18, 36] or to aid various aerial cinematography tasks [3