SD-Net: Understanding overcrowded scenes in real-time via an efficient dilated convolutional neural network

  • PDF / 2,814,601 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 55 Downloads / 183 Views

DOWNLOAD

REPORT


SPECIAL ISSUE PAPER

SD‑Net: Understanding overcrowded scenes in real‑time via an efficient dilated convolutional neural network Noman Khan1 · Amin Ullah1 · Ijaz Ul Haq1 · Varun G. Menon2 · Sung Wook Baik1 Received: 10 May 2020 / Accepted: 13 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract The advancements in computer vision-related technologies attract many researchers for surveillance applications, particularly involving the automated crowded scenes analysis such as crowd counting in a very congested scene. In crowd counting, the main goal is to count or estimate the number of people in a particular scene. Understanding overcrowded scenes in real-time is important for instant responsive actions. However, it is a very difficult task due to some of the key challenges including clutter background, occlusion, variations in human pose and scale, and limited surveillance training data, that are inadequately covered in the employed literature. To tackle these challenges, we introduce “SD-Net” an end-to-end CNN architecture, which produces real-time high quality density maps and effectively counts people in extremely overcrowded scenes. The proposed architecture consists of depthwise separable, standard, and dilated 2D convolutional layers. Depthwise separable and standard 2D convolutional layers are used to extract 2D features. Instead of using pooling layers, dilated 2D convolutional layers are employed that results in huge receptive fields and reduces the number of parameters. Our CNN architecture is evaluated using four publicly available crowd analysis datasets, demonstrating superiority over state-of-the-art in terms of accuracy and model size. Keywords  Crowd counting · Crowded scenes · Deep learning · Dilated convolutional neural network · Real-time · Surveillance

1 Introduction Recently, numerous network models have been built to provide encouraging solutions for the crowd flow observance, assembly dominant, and different security facilities [1, 2]. Such methods can be divided into two key groups according to the output i.e., counting people or generating density maps [3]. People counting methods take a video frame or an image as input and produce an output that is a numeric value indicating the actual sum of persons presented in the input image, while the methods which are based on density map try to show the actual properties of distribution of a crowd in an image [4]. However, there is a possibility that * Sung Wook Baik [email protected] 1



Intelligent Media Laboratory, Digital Contents Research Institute, Sejong University, Seoul, Republic of Korea



Department of Computer Science and Engineering, SCMS School of Engineering and Technology, Ernakulam 683576, India

2

different images having similar numbers of individuals can have different distributions as shown in Fig. 1 which fails in crowd analysis perfectly using people counting methods in real-world applications. Therefore, density distribution maps based methods are more suitable for risky environments such