People Counting in Videos by Fusing Temporal Cues from Spatial Context-Aware Convolutional Neural Networks
We present an efficient method for people counting in video sequences from fixed cameras by utilising the responses of spatially context-aware convolutional neural networks (CNN) in the temporal domain. For stationary cameras, the background information r
- PDF / 1,389,050 Bytes
- 13 Pages / 439.37 x 666.142 pts Page_size
- 71 Downloads / 197 Views
School of Computer Science and Mathematics, Kingston University, Kingston, UK [email protected], [email protected] 2 Department of Computer Science, Universidad Carlos III de Madrid, Getafe, Spain [email protected] 3 Departmento de Informática, Universidad de Santiago de Chile, Santiago, Chile [email protected] 4 Faculty of Engineering and Applied Sciences, Universidad de los Andes, Santiago, Chile [email protected]
Abstract. We present an efficient method for people counting in video sequences from fixed cameras by utilising the responses of spatially context-aware convolutional neural networks (CNN) in the temporal domain. For stationary cameras, the background information remains fairly static, while foreground characteristics, such as size and orientation may depend on their image location, thus the use of whole frames for training a CNN improves the differentiation between background and foreground pixels. Foreground density representing the presence of people in the environment can then be associated with people counts. Moreover the fusion, of the responses of count estimations, in the temporal domain, can further enhance the accuracy of the final count. Our methodology was tested using the publicly available Mall dataset and achieved a mean deviation error of 0.091. Keywords: People counting Convolutional neural networks Video analysis
1 Introduction Counting people can provide useful information for monitoring purposes in public areas, assist urban planners in designing more efficient environments, provide cues for situations that might endanger the safety of civilians, and also be used by shopping mall and retail store managers for evaluating their business practices. In principle, such knowledge can be obtained by analysing image and video footage from location specific cameras with the goal to measure the number of people in them. For this reason in this work we present an efficient method for counting people in images and video © Springer International Publishing Switzerland 2016 G. Hua and H. Jégou (Eds.): ECCV 2016 Workshops, Part II, LNCS 9914, pp. 655–667, 2016. DOI: 10.1007/978-3-319-48881-3_46
656
P. Sourtzinos et al.
sequences, from fixed cameras which incorporates the fusion of context aware cues from CNN in the temporal domain. People counting is a very challenging problem, and although commercial solutions exist, these focus mainly in top-view cameras, where occlusions between people are minimal. An effective approach is to detect the heads of the pedestrians present in an image, since they are less prone to disappear in the image through occlusions, and then sum the head detections to measure the total count. Such an approach seems consistent to how humans would approach the problem, as implied by expressions such as ‘headcount’. Furthermore since our interest is in measuring the count of people using stationary cameras, where background is assumed fairly static, a local context-aware detector that is spatially tuned to distinguish foreground objects (e.g. hea
Data Loading...