Face Detection with End-to-End Integration of a ConvNet and a 3D Model

This paper presents a method for face detection in the wild, which integrates a ConvNet and a 3D mean face model in an end-to-end multi-task discriminative learning framework. The 3D mean face model is predefined and fixed (e.g., we used the one provided

  • PDF / 4,232,271 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 50 Downloads / 229 Views

DOWNLOAD

REPORT


Nat’l Engineering Laboratory for Video Technology, Key Laboratory of Machine Perception (MoE), Cooperative Medianet Innovation Center, Shanghai Sch’l of EECS, Peking University, Beijing 100871, China {leo.liyunzhu,sunbenyuan,Yizhou.Wang}@pku.edu.cn 2 Department of ECE and the Visual Narrative Cluster, North Carolina State University, Raleigh, USA tianfu [email protected]

Abstract. This paper presents a method for face detection in the wild, which integrates a ConvNet and a 3D mean face model in an end-to-end multi-task discriminative learning framework. The 3D mean face model is predefined and fixed (e.g., we used the one provided in the AFLW dataset). The ConvNet consists of two components: (i) The face proposal component computes face bounding box proposals via estimating facial key-points and the 3D transformation (rotation and translation) parameters for each predicted key-point w.r.t. the 3D mean face model. (ii) The face verification component computes detection results by pruning and refining proposals based on facial key-points based configuration pooling. The proposed method addresses two issues in adapting stateof-the-art generic object detection ConvNets (e.g., faster R-CNN) for face detection: (i) One is to eliminate the heuristic design of predefined anchor boxes in the region proposals network (RPN) by exploiting a 3D mean face model. (ii) The other is to replace the generic RoI (Regionof-Interest) pooling layer with a configuration pooling layer to respect underlying object structures. The multi-task loss consists of three terms: the classification Softmax loss and the location smooth l1 -losses of both the facial key-points and the face bounding boxes. In experiments, our ConvNet is trained on the AFLW dataset only and tested on the FDDB benchmark with fine-tuning and on the AFW benchmark without finetuning. The proposed method obtains very competitive state-of-the-art performance in the two benchmarks. Keywords: Face detection · Face 3D model · ConvNet · Deep learning · Multi-task learning

Y. Li and B. Sun contributed equally to this work and are joint first authors. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part III, LNCS 9907, pp. 420–436, 2016. DOI: 10.1007/978-3-319-46487-9 26

Face Detection with a ConvNet and a 3D Model

1 1.1

421

Introduction Motivation and Objective

Face detection has been used as a core module in a wide spectrum of applications such as surveillance, mobile communication and human-computer interaction. It is arguably one of the most successful applications of computer vision. Face detection in the wild continues to play an important role in the era of visual big data (e.g., images and videos on the web and in social media). However, it remains a challenging problem in computer vision due to the large appearance variations caused by nuisance variabilities including viewpoints, occlusion, facial expression, resolution, illumination and cosmetics, etc. It has been a long history that computer vision researchers study how to learn a better representa