CNN - Week 2

ResNets, 1x1 convolutions, Inception Networks and various practical advice.

ResNets

A residual net let’s information from early on in the network feed forward deeply (short cut/skip connection) to a later layer, so it’s activation function becomes .

Helps counteract vanishing gradients in deep networks, and since the intermediary layers can just shrink down to nothing due to regularisation if they don’t contribute positively compared to a simpler network, it will either help or be neutral.

If input and output has different dimensions, where is a matrix that ensures the output number of rows will match what is expected. Usually though, same convolution is used in order to preserve dimensions in ResNets so you can avoid this, though that of course doesn’t go for pooling layers.

1x1 Convolutions

Can be used to shrink number of channels . Instead of applying in the spatial dimensions, it can apply across multiple channels, so where n is number of channels in input.

Can also be used to just add non-linearity across the channels, by outputting the same number of channels as it took as input.

Inception Networks

Refers to running multiple filter sizes in parallel, and even pooling (that uses padding to preserve output dimensions).

One challenge with inception networks is the computation costs. Having an intermediary 1x1 convolution to scale down beforehand can reduce this problem.

Practical Advice

Using Open Source

Many models described in papers might contain a lot of small details and hyper parameter tunings that are hard to replicate from scratch. Thus, oftentimes it can be good to start from an open-source implementation of said architecture.

Transfer Learning

When downloading an open source model, you can also download pre-trained weights on a relevant dataset for that model. Then replace the softmax output layer with a new one that you train on your own (limited) dataset.

Frameworks for deep learning usually let you freeze layers from being trained so the focus only will be on the output layer, and possibly n last layers that you’ve chosen not to freeze. Similarly, since the intermediary layers will be fixed for your training set, a framework might let you skip recomputing them every time but instead just train on the once computed activation output on the last frozen layer.

Alternatively, store the output weights to disk and then just implement a simple model with the softmax using the outputs as training data, and then merge with the full model.

Data Augmentation

For computer vision, more data will almost always improve results. One way to do this dynamically during training is to allow one CPU thread to handle the distorting of loaded images to create minibatches that then gets sent to the GPU (or another CPU thread) for training. This means the processes might run in parallel.

The algorithm used to distort the input data will contain its own set of hyper parameters that could be tuned as well. Similarly to models, it might be good to start with an existing open source implementation of a data augmenter.

State of Computer Vision

More data generally leads to simpler algorithms producing good results. For problems where less data is available, more hand engineering of features are still needed.

Because of the common lack of data for computer vision tasks, that is one reason transfer learning is popular.

Tips for competitions/benchmarks

Some things that might push performance can still be detrimental to something meant to go into production (due to performance reasons etc…). These things includes;

  • Ensembling multiple models (even of same network architecture but with different initialisation weights)

  • Multi-crop at test time (e.g. 10-crop). Run data augmentation on the test set before testing…

Comments