Why over-parameterization of deep neural networks does not overfit?

  • PDF / 178,342 Bytes
  • 3 Pages / 595 x 842 pts (A4) Page_size
  • 105 Downloads / 235 Views

DOWNLOAD

REPORT


. PERSPECTIVE .

January 2021, Vol. 64 116101:1–116101:3 https://doi.org/10.1007/s11432-020-2885-6

Why over-parameterization of deep neural networks does not overfit? Zhi-Hua ZHOU National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China Received 11 April 2020/Accepted 15 April 2020/Published online 14 September 2020

Citation Zhou Z-H. Why over-parameterization of deep neural networks does not overfit? Sci China Inf Sci, 2021, 64(1): 116101, https://doi.org/10.1007/s11432-020-2885-6

Deep neural networks often come with a huge number of parameters, even larger than the number of training examples, but it seems that these over-parameterized models have not suffered from overfitting. This is quite strange and why over-parameterization does not overfit ? poses a fundamental question concerning the mysteries behind the success of deep neural networks. In conventional machine learning theory, let H denote the hypothesis space, m is the size of a training set with i.i.d. samples, then the gap between the generalization error p and empirical error is often bounded by O( |H|/m) where |H| is about the hypothesis space complexity. If the whole hypothesis space represented by a deep neural network is considered, then the numerator grows with the parameter count (depth × width), which can be even larger than the denominator, leading to vacuous bounds. Thus, many studies resorted to consider relevant subset of hypothesis space, e.g., by introducing implicit bias depending on specific algorithms such as the norms controlled by stochastic gradient descent (SGD) [1, 2]. The results, however, were not that satisfactory and recently there were even claims that conventional learning theory could not be used to explain generalization of deep neural networks even if the implicit bias of specific algorithms had been taken into account to the fullest extent possible [3]. Although many arguments may have their own groundings, we feel that an important fact should be noticed; that is, conventional learning theory concerns mostly about the training of a learner, or more specifically, a classifier in classification tasks, from a feature space, but concerns little about the construction of the feature space itself. Therefore, conventional learning theory can be exploited to understand the behavior of generalization, but one must be careful when it is applied to representation learning. It is well-known that deep neural networks accomplish end-to-end learning through integrating feature learning with classifier training. As illustrated in Figure 1(a), a deep neural network can be decomposed into two parts, where the first part devotes to feature space transformation, i.e., converting the original feature space represented by the input layer to the final feature space represented by the final

representation layer, in which a classifier is constructed. First, let’s focus on the classifier construction (CC) part in Figure 1(a), where the number of parameters depends on the number of units in the final rep