Supervised Transformer Network for Efficient Face Detection

Large pose variations remain to be a challenge that confronts real-word face detection. We propose a new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address this challenge. The first stage is a multi-task Regi

  • PDF / 2,584,849 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 59 Downloads / 274 Views

DOWNLOAD

REPORT


Abstract. Large pose variations remain to be a challenge that confronts real-word face detection. We propose a new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address this challenge. The first stage is a multi-task Region Proposal Network (RPN), which simultaneously predicts candidate face regions along with associated facial landmarks. The candidate regions are then warped by mapping the detected facial landmarks to their canonical positions to better normalize the face patterns. The second stage, which is a RCNN, then verifies if the warped candidate regions are valid faces or not. We conduct end-to-end learning of the cascaded network, including optimizing the canonical positions of the facial landmarks. This supervised learning of the transformations automatically selects the best scale to differentiate face/non-face patterns. By combining feature maps from both stages of the network, we achieve state-of-the-art detection accuracies on several public benchmarks. For real-time performance, we run the cascaded network only on regions of interests produced from a boosting cascade face detector. Our detector runs at 30 FPS on a single CPU core for a VGA-resolution image.

1

Introduction

Among the various factors that confront real-world face detection, large pose variations remain to be a big challenge. For example, the seminal Viola-Jones [1] detector works well for near-frontal faces, but become much less effective for faces in poses that are far from frontal views, due to the weakness of the Haar features on non-frontal faces. There were abundant works attempted to tackle with large pose variations under the regime of the boosting cascade advocated by Viola and Jones [1]. Most of them adopt a divide-and-conquer strategy to build a multi-view face detector. Some works [2–4] proposed to train a detector cascade for each view and combine their results of all detectors at the test time. Some other works [5–7] proposed to first estimate the face pose and then run the cascade of the corresponding face pose to verify the detection. The complexity of the former approach increases Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46454-1 8) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part V, LNCS 9909, pp. 122–138, 2016. DOI: 10.1007/978-3-319-46454-1 8

Supervised Transformer Network for Efficient Face Detection

123

with the number of pose categories, while the accuracy of the latter is prone to the mistakes of pose estimation. Part-based model offers an alternative solution [8–10]. These detectors are flexible and robust to both pose variation and partial occlusion, since they can reliably detect the faces based on some confident part detections. However, these methods always require the target face to be large and clear, which is essential to reliably model the parts. Other works approach to this issue by using more sophisticated