Deep Networks with Stochastic Depth
Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks come
- PDF / 565,390 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 67 Downloads / 224 Views
Cornell University, Ithaca, USA {gh349,ys646,dms422,kqw4}@cornell.edu 2 Tsinghua University, Beijing, China [email protected]
Abstract. Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91 % on CIFAR-10).
1
Introduction
Convolutional Neural Networks (CNNs) were arguably popularized within the vision community in 2009 through AlexNet [1] and its celebrated victory at the ImageNet competition [2]. Since then there has been a notable shift towards CNNs in many areas of computer vision [3–8]. As this shift unfolds, a second trend emerges; deeper and deeper CNN architectures are being developed and trained. Whereas AlexNet had 5 convolutional layers [1], the VGG network and GoogLeNet in 2014 had 19 and 22 layers respectively [5,7], and most recently the ResNet architecture featured 152 layers [8]. Network depth is a major determinant of model expressiveness, both in theory [9,10] and in practice [5,7,8]. However, very deep models also introduce new challenges: vanishing gradients in backward propagation, diminishing feature reuse in forward propagation, and long training time.
G. Huang and Y. Sun are contributed equally. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part IV, LNCS 9908, pp. 646–661, 2016. DOI: 10.1007/978-3-319-46493-0 39
Deep Networks with Stochastic Depth
647
Vanishing Gradients is a well known nuisance in neural networks with many layers [11]. As the gradient information is back-propagated, repeated multiplication or convolution with small weights renders the gradient information ineffectively small in earlier layers. Several approaches exist to reduce this effect in practice, for example through careful initialization [12], hidden layer supervision [13], or, recently, Batch Normalization [14]. Diminishing feature reuse during forward propagation (also known as loss in information flow [15]) refers to the analogous problem to vanishing gradients in the forward direction. The features of the input instance, or those computed by e
Data Loading...