Hyperparameter Optimization with Factorized Multilayer Perceptrons

In machine learning, hyperparameter optimization is a challenging task that is usually approached by experienced practitioners or in a computationally expensive brute-force manner such as grid-search. Therefore, recent research proposes to use observed hy

  • PDF / 482,976 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 30 Downloads / 257 Views

DOWNLOAD

REPORT


Sequential model-based

Introduction

Unfortunately, machine learning models are very rarely parameter-free, as they usually contain a set of hyperparameters which have to be chosen appropriately on validation data. As a simple example, the number of latent variables in a matrix factorization cannot be determined using gradient descent as firstly, it is not explicitly given in the objective function and secondly is not a continuous but a discrete parameter. Additionally, the choice of kernel function for an SVM can also be understood as hyperparameter, where gradient descent approaches fail. Besides being a parameter of learned model, hyperparameters can also be part of the objective function, such as regularization constants. Moreover, they can also be part of c Springer International Publishing Switzerland 2015  A. Appice et al. (Eds.): ECML PKDD 2015, Part II, LNAI 9285, pp. 87–103, 2015. DOI: 10.1007/978-3-319-23525-7 6

88

N. Schilling et al.

the learning algorithm that is used to optimize the model for the objective function, for example the steplength of a gradient based technique or the threshold of a stopping criterion. Finally, even the choice of preprocessing can be viewed as a hyperparameter. Some of these hyperparameters are continuous, some are categorical, but what they all have in common is that there is no efficient learning algorithm for them. Therefore many researchers rely on searching them on a grid, which is computationally very expensive, as with growing data and growing complexity of models the optimization part usually requires a lot of time. The performance of a model on test data trained with specific hyperparameters depends on the data set where the machine learning model should be learned, and therefore hyperparameter optimization is usually started from the scratch for each new data set. Thus, possibly valuable information of past hyperparameter performance on other data sets is ignored. Recent work proposes to use this information to be able to perform a more efficient and faster hyperparameter optimization than before [2]. To accomplish this, the sequential model-based optimization framework is applied, where a surrogate model is learned to predict hyperparameter performances in a first step. Then an acquisition function is queried to choose the next hyperparameter to test while maintaining a reasonable tradeoff between exploration and exploitation. As the prediction of the surrogate model can be done in constant time, hyperparameters can be optimized in a controlled way, resulting in less runs of the actual learning algorithm until a promising configuration is found. This paper targets the problem of hyperparameter learning and more generally model selection across different data sets. We propose to use a multilayer perceptron as surrogate model and show how it can be learned to also include hyperparameter performances of data sets observed in the past. Additionally, we propose a factorized multilayer perceptron that contains a factorization part in the first layer of the network to direct