CNN - Week 4 · Freddie Karlbom

Week 4 covered challenges for doing face verification and recognition and how to do neural style transfer.

Face verification and recognition

Verification, is this person X?
Recognition, who is this person?

Even if the accuracy is high on the verification, since we need to test vs all images in the possible space for recognition, we might end up with false positives if the amount of possible people in the set is large.

One shot learning

Oftentimes with face recognition, you only have one image of the employee to learn from which usually doesn’t work well with learning.

A way to get around this is to train the network to learn a similarity function between images so that if $d(img1, img2) \leq \tau$ then they are a match.

For a new image, you perform pairwise comparisons towards all known persons in the database.

Another of the benefits of this approach, as opposed to a softmax multiclassification with likelihoods, is that adding new persons doesn’t impact the difference ratings for the others, so they remain unaffected.

Siamese network

A network is put in place that outputs a unit encoding based on an input image. This output vector is then compared to another output vector generated by an identical network (i.e. Siamese twin…) to calculate the distance between the encodings.

Triplet loss

In order to learn the vector encoding of images, you can use the triplet loss function which trains on triplets of images with one anchor image (A) of the person, and then two images of which one is another image of the person in question (P), and the other is an image of another person (N).

The goal during training is to ensure;

$\left\| f(A) - f(P) \right\|^2 + \alpha \leq \left\| f(A) - f(N) \right\|^2$

In order to avoid the network learning a trivial solution of always outputting zero, the $\alpha$ parameter is added in order to ensure we have a margin between the two distance function outputs. Rearranged we get the loss function;

$\mathscr{L}(A, P, N) = max(\left\| f(A) - f(P) \right\|^2 - \left\| f(A) - f(N) \right\|^2+ \alpha, 0)$

Since as long as the value is below zero, the loss is zero. Summed over the set for the cost function;

$\mathcal{J} = \sum_{i=1}^{m}{\mathscr{L}(A^{(i)}, P^{(i)}, N^{(i)})}$

Choosing triplets

When preparing the training sets, it’s good to choose negative pictures in each triplet that is hard to train on, i.e. where the distance isn’t very big.

Choosing triplets randomly will mean very many triplets are very easy, and thus won’t force the algorithm to learn useful features it will encounter in the wild.

Face verification and binary classification

An alternative to triplet loss is to generate encodings for two images that then are fed into a logistic output unit for binary classification of whether the input images are of the same person or not.

Neural Style Transfer

Based on input images content (C) and style (S), a third image is generated (G) that applies the styles from S onto the content of C. In broad strokes;

Initiate the generated image G randomly
Use gradient descent to minimize $\mathcal{J}(G)$

Cost functions

$\mathcal{J}(G) = \alpha\mathcal{J}_{Content}(C, G) + \beta\mathcal{J}_{Style}(S, G)$

Where the content cost function is the difference in activation between the content image and the generated image in layer $l$ in a pretrained ConvNet.

$\mathcal{J}_{Content}(C, G) = \frac{1}{2}\left\|a^{[l](c)} - a^{[l](G)}\right\|^2$

For the style cost function, “style” is defined as the correlation between activations across channels. Since the higher level features becomes things like patterns, textures and shapes and they tend to correlate heavily in the source image set, they should also correlate in the generated image.

$a^{[l]}_{i,j,k}$ is activation at $(i,j,k)$ . $G^{[l]}$ is $n^{[l]}_c \times n^{[l]}_c$ .

$G^{[l](S)}_{kk'} = \sum_{i=1}^{n^{[l]}_H}{\sum_{j=1}^{n^{[l]}_W}{a^{[l](S)}_{i,j,k}a^{[l](S)}_{i,j,k'}}}$

…and the same for the generated image.

$G^{[l](G)}_{kk'} = \sum_{i=1}^{n^{[l]}_H}{\sum_{j=1}^{n^{[l]}_W}{a^{[l](G)}_{i,j,k}a^{[l](G)}_{i,j,k'}}}$

These style matrixes are also sometimes called gram matrixes, which is why G is used for the function.

$\mathcal{J}_{Style}(S, G) = \left\| G^{[l](S)}_{kk'} - G^{[l](G)}_{kk'} \right\|^2_F$

…restating and adding a normalisation term;

$\mathcal{J}^{[l]}_{Style}(S, G) = \frac{1}{(2n^{[l]}_{H} n^{[l]}_{W} n^{[l]}_{C})^2} \sum_k{\sum_{k'}{(G^{[l](S)}_{kk'} - G^{[l](G)}_{kk'})}}$

Finally, in practice usually you get a more aestethically pleasing result using a combination of multiple layers.

$J_{Style}(S,G) = \sum_l{\lambda^{[l]}J_{Style}^{[l]}(S, G)}$

Face verification and recognition

One shot learning

Siamese network

Triplet loss

Choosing triplets

Face verification and binary classification

Neural Style Transfer

Cost functions

Comments

Related Posts

Matrix Factorization and Advanced Techniques - Week 6 10 May 2020

Matrix Factorization and Advanced Techniques - Week 5 09 May 2020

DeepFM Paper - Notes 29 Apr 2020