Deep Neural Network Compression via Knowledge Distillation for Embedded Vision Applications

The deep learning has recently pushed the boundaries of computational understanding of the various complex problems which were earlier thought to be unsolved or even not conceived. The challenge arises when implementing such deep neural networks on the li

  • PDF / 486,556 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 47 Downloads / 232 Views

DOWNLOAD

REPORT


1 Introduction Since the last half decade, deep learning has pushed the boundaries of the machines to imitate the human-like decision-making capabilities. Graphics Processing Units (GPUs) collated with neural architectures are the major contributors for this development. GPUs with their unique parallel processing of the data chunks have overshadowed the other research areas in this field. Knowledge distillation is one such idea initially proposed by Bucila et al. [1] and taken afresh with a new perspective in 2015 by Hinton et al. [2]. In this approach, the lightweight model, called the student, learns from a high-level model, called the teacher. The teacher–student approach tries to imitate the transfer of knowledge from one entity to other. The high-level teacher is a heavyweight model, regarding memory size, runtime memory requirement, computation cost, computation time, etc. This learning approach is feasible provided a high-end GPU, which can do the computation in the limited period, as the main processor alone is not capable of finishing these computations in the same time duration. In this work, a teacher–student model is proposed for applications having lesser computation capabilities. The student is trained along with the teacher, and the neural networks can be deployed in memory-limited settings. The proposed structural model distillation for memory reduction architecture is explored using a strategy to have a student model that is a simplified teacher model: no redesign is needed, and the same hyperparameters can be used. With this approach, there are substantial memory savings possible with very little loss of accuracy and the knowledge distillation provides the student model to perform better than training the same student model directly on data. B. Jaiswal · N. P. Gajjar (B) Electronics and Communication Engineering Department, Institute of Technology, Nirma University, Ahmedabad, Gujarat, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2019 R. K. Shukla et al. (eds.), Data, Engineering and Applications, https://doi.org/10.1007/978-981-13-6347-4_13

139

140

B. Jaiswal and N. P. Gajjar

2 State of the Art Figure 1 below demonstrates an architecture of a typical network of neurons receiving data from the external world into the input layer and gives classification results as class probabilities through the output layer. There are multiple interconnected hidden layers between the input and output layer. These hidden layers could have feedforward or feedback connections. All the layers consist of a basic unit called neuron (shown as circles in Fig. 1). The computational output of one neuron is passed on to the neurons in the next layers. The neurons are triggered nonlinear activation function, like biological neurons in our brain. The present work considers feed-forward-type connection in which information is transferred from the input side to the output side. Neural network training involves the weights initialisation and updates after every iteration. The network wei