FastAI - Notes - Deep Learning week 1-2

Some interesting tricks not mentioned in the Deep Learning.ai course (based on recent papers), and apparently at time of recording at least only implemented in the fastAI library that sits on top of PyTorch.

Utilise SGD with Restarts

Referencing Cyclical Learning Rates for Training Neural Networks suggesting restarts for annealing learning rates, which should help get out of local saddle points during learning, and thus more quickly converge on the global minima.

Cycle multiplier

Adds a way to have slower learning rate decay for later cycles. This will mean we allow more oscillating in later epochs, as we hopefully already have approached a global minimum with a smooth area, that shouldn’t be very sensitive to specific minibatches.

Find the optimal learning rate

The above mentioned paper also suggests an approach to find the optimal learning rate in a systematic way.

Differential Learning rates

When training for example an CNN that starts from a pretrained model, we might want to unfreeze and finetune the earlier layers as well at some point to ensure they are adjusted to our specific problem.

Since these already have been pretrained on a (probably) much bigger dataset, we don’t want to allow them to be changed to quickly, compared to more high level features later on in the network.

Thus, the fast.ai library supports differential learning rates so you can pass in an array [lr1, lr2, lr3] where the number of items determines the split for the network, so that the layers falling into 100/n will use lr_n.

Variable input size

For CNN, in order to speed up training a suggestion is to train on low resolution versions of the images initially, and then scale up to larger versions. This also helps avoid overfitting and can be seen as a type of data augmentation.

Checklist of easy steps to train a world-class image classifier

precompute=True, i.e. don’t recalculate all the values in the frozen layer upon each iteration but use the precomputed outputs while training…
Use lr_find() to find highest learning rate where loss is still clearly improving
Train last layer(s) from precomputed activations for 1-2 epochs
Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
Unfreeze all layers
Set earlier layers to 3x-10x lower learning rate than next higher layer
Use lr_find() again
Train full network with cycle_mult=2 until over-fitting

Utilise SGD with Restarts

Cycle multiplier

Find the optimal learning rate

Differential Learning rates

Variable input size

Checklist of easy steps to train a world-class image classifier

Comments

Related Posts

Matrix Factorization and Advanced Techniques - Week 6 10 May 2020

Matrix Factorization and Advanced Techniques - Week 5 09 May 2020

DeepFM Paper - Notes 29 Apr 2020