Some interesting tricks not mentioned in the Deep Learning.ai course (based on recent papers), and apparently at time of recording at least only implemented in the fastAI library that sits on top of PyTorch.
Utilise SGD with Restarts
Referencing Cyclical Learning Rates for Training Neural Networks suggesting restarts for annealing learning rates, which should help get out of local saddle points during learning, and thus more quickly converge on the global minima.
Cycle multiplier
Adds a way to have slower learning rate decay for later cycles. This will mean we allow more oscillating in later epochs, as we hopefully already have approached a global minimum with a smooth area, that shouldn’t be very sensitive to specific minibatches.
Find the optimal learning rate
The above mentioned paper also suggests an approach to find the optimal learning rate in a systematic way.
Differential Learning rates
When training for example an CNN that starts from a pretrained model, we might want to unfreeze and finetune the earlier layers as well at some point to ensure they are adjusted to our specific problem.
Since these already have been pretrained on a (probably) much bigger dataset, we don’t want to allow them to be changed to quickly, compared to more high level features later on in the network.
Thus, the fast.ai library supports differential learning rates so you can pass in an array [lr1, lr2, lr3]
where the number of items determines the split for the network, so that the layers falling into 100/n
will use lr_n
.
Variable input size
For CNN, in order to speed up training a suggestion is to train on low resolution versions of the images initially, and then scale up to larger versions. This also helps avoid overfitting and can be seen as a type of data augmentation.
Checklist of easy steps to train a world-class image classifier
- precompute=True, i.e. don’t recalculate all the values in the frozen layer upon each iteration but use the precomputed outputs while training…
- Use lr_find() to find highest learning rate where loss is still clearly improving
- Train last layer(s) from precomputed activations for 1-2 epochs
- Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
- Unfreeze all layers
- Set earlier layers to 3x-10x lower learning rate than next higher layer
- Use lr_find() again
- Train full network with cycle_mult=2 until over-fitting
Comments