Cyclical Learning Rates for Training Neural Networks
18 Mar 2018Introduction

Conventional wisdom says that when training neural networks, learning rate should monotonically decrease. This insight forms the basis of the different type of adaptive learning rates.

Counter to this expected behaviour, the paper demonstrates that using a cyclical learning rate (CLR), varying between a minimum and a maximum value, helps to train the neural network faster without requiring finetuning of learning rate.

The paper also provides a simple approach to estimate the lower and upper bound for CLR.
Intution

Difficulty in minimizing the loss arises from saddle points and not from local minima. [Ref]

Increasing the learning rate allows for rapid traversal of saddle points.

Alternatively, the optimal learning rate is expected to be between bounds of CLR and thus the learning rate would always be close to the optimal learning rate.
Parameter Estimation

Cycle Length = Number of iterations till learning rate returns to the initial value = 2 * step_size

step_size should be set to 210 times the number of iterations in an epoch.

Estimating the CLR boundary values:

Run the model for several epochs while increasing the learning rate between the allowed low and high values.

Plot accuracy vs learning rate and note the learning rate values when the accuracy starts to fall.

This gives a good candidate value for upper and lower bound. Alternatively, the lower bound could be set to be 1/3 or 3/4 of the upper bound. But it is difficult to judge if the model has run for the sufficient number of epochs in the first place.

Notes
 The idea in itself is very simple and straightforward to add to any existing model which makes it very appealing.
 The author has experimented with various architectures and datasets (from vision domain) and has reported faster training results.