Neural Style Transfer in a Nutshell

Learn what makes this fascinating technology work.

Introduction

Neural Style Transfer (NST) is a technique where the style of one image is applied to another one while the content of the original image is kept. The result is a combined image of the content from the content image painted or drawn in the style of the style image. Probably the most famous example used to demonstrate style transfer is van Gogh’s Starry Night. Look at the pictures below and see how the style of Starry Night has been applied to the content image:

Enthusiasts around the world have applied the style of van Gogh’s masterpiece to many of their own pictures. Let’s face it, not everyone is a born artist, and getting some help from the old masters is much appreciated. The resulting images can be used for decorations, as a profile picture, or in marketing and advertisement. It is even possible to paint a whole movie in a certain style by stylizing it frame by frame.

Neural style transfer (NST)

The technique of neural style transfer was first published in Leon A. Gatys’ paper, A Neural Algorithm of Artistic Style in 2015. His version of neural style transfer uses a feature extraction network and 3 images, a content image, a style image, and an input image. Think of the input image as a blank canvas that will become the stylized image during the process. At the very beginning, the input image is initialized with white noise. That is, every pixel in the image has a random color. The feature extraction network is a pre-trained image classification network that contains several convolutional layers. 

Image with White Noise

How Neural Style Transfer Works

  • At first the style image runs through the feature extraction network and the style values are measured and saved.
  • Then, the content image runs through the feature extraction network and the content values are measured and saved.
  • Now an iterative process starts:
    • The input image runs through the feature extraction network.
    • The content and style values are measured and compared to the saved content and style values from before.
    • From the measured values and the saved values the content loss and style loss are being calculated.
    • The content loss and style loss are being combined into a total loss.
    • An optimizing algorithm such as stochastic gradient descent (SGD) is used to optimize the input image pixel by pixel.
    • The iterative process continues until the total loss value saturates.

The result is a single stylized image. The diagram below shows the stylizing process of an image after the content and style baselines have been determined:

Original Neural Style Transfer

Feature Extraction Network

What makes style transfer so interesting (besides the awesome pictures it can create) is how it demonstrates the capabilities and internal representations of neural networks. As we have seen in the example above, the feature extraction network plays an important role to determine the content and style values for an image. The feature extraction network is a pre-trained image classifier. The classifier’s weights are frozen during the styling process. That means, there is no training of the network at all. A classifier that is often used in style transfer is the VGG19 that is trained on ImageNet with 1000 classes. The straight forward architecture of the VGG networks makes them a good candidate for style transfer applications. The VGG networks have several CNN layers on top of each other with no shortcuts in between them.

CNN (3×3)
block_4
4 layers
(28x28x512)
CNN (3×3)…
CNN (3×3)
block_5
4 layers
(14x14x512)
CNN (3×3)…
maxpool
maxpool
FC1 
(4096)
FC1…
FC2 
(4096)
FC2…
softmax
(1000)
softmax…
CNN (3×3)
block_3
4 layers
(56x56x256)
CNN (3×3)…
CNN (3×3)
block_2
2 layers
(112x112x128)
CNN (3×3)…
CNN (3×3)
block_1
2 layers
(224x224x64)
CNN (3×3)…
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
VGG-19 Image Classifier
VGG-19 Image Classifier
Viewer does not support full SVG 1.1

The image classifier contains many convolutional layers that are great for detecting patterns and objects in images. While the first layers tend to learn low-level features such as edges or corners, the deeper layers learn more and more complex structures. The last few layers in VGG19 (FC1, FC2 and softmax) are needed for classification and don’t have a use in style transfer. It is save to leave them out.

When an image runs through the network, the activation in the convolutional layers can be used to measure content and style of an image. The content and style loss is then calculated in comparison with the content and style values determined for the content and style image.

Content Loss

When an image runs through the network, it will activate filters in the convolutional layers. The content values are then measured as the activations of each filter in a convolutional layer. The content loss is calculated by the Euclidean Distance between the input image’s measured content values, and the content image’s saved values. If two images activate the same features in the network, their content must be similar.

The content loss can be calculated as follows:

L_{content} = \sum_{l}\sum_{i,j}(\alpha C_{i,j}^{s, l} - \alpha C_{i,j}^{c, l})^2 

But what values are actually used in this equation, where do they come from? The alpha is just a hyper parameter to control how much weight the content loss shall have in the total loss. The l is the CNN layer in the VGG network. The i is the feature map channel, that is each feature map in a CNN layer creates a channel. The j is the position in the channel.

CNN (3×3)
block_4
4 layers
(28x28x512)
CNN (3×3)…
CNN (3×3)
block_5
4 layers
(14x14x512)
CNN (3×3)…
maxpool
maxpool
FC1 
(4096)
FC1…
FC2 
(4096)
FC2…
softmax
(1000)
softmax…
CNN (3×3)
block_3
4 layers
(56x56x256)
CNN (3×3)…
CNN (3×3)
block_2
2 layers
(112x112x128)
CNN (3×3)…
CNN (3×3)
block_1
2 layers
(224x224x64)
CNN (3×3)…
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
Deriving Content Value from VGG-19 Image Classifier
Deriving Content Value from VGG-19 Image Class…
feature maps
128x
(112×112)
feature maps…
a1_1
a1_1
a1_2
a1_2
a1_3
a1_3
a1_j
a1_j
a2_1
a2_1
a2_2
a2_2
a2_3
a2_3
a2_j
a2_j
ai_1
ai_1
ai_2
ai_2
ai_3
ai_3
a1_j
a1_j
channel i (1 – 128)
channel i (1 – 128)
position j (1 – 12544 (112×112))
position j (1 – 12544 (112×112…
Flattened Feature Maps
Flattened Feature Maps
feature maps
of
CNN layer l=4
(conv2_2)
feature maps…
flattened
feature maps
flattened…
content value
content value
Cl
Cl
Viewer does not support full SVG 1.1

Experiments have shown that taking the activations from block2, layer2 (or conv2_2 for short) generate good results.

Style Loss

To derive the style loss things get a bit more complex but the principle is actually the same. A style value is derived from feature activations in multiple convolutional layers. The style loss then is calculated from this value and the saved values of the style image using a Gram matrix.

The Gram Matrix is defined as follow:

G_{i,k}^l = \sum_{k} F_{i,k}^l F_{j,k}^l

Below is an example how the Gram matrix is derived for layer 2. The same is done for a few more layers in the VGG network to calculate the total style loss.

CNN (3×3)
block_4
4 layers
(28x28x512)
CNN (3×3)…
CNN (3×3)
block_5
4 layers
(14x14x512)
CNN (3×3)…
maxpool
maxpool
FC1 
(4096)
FC1…
FC2 
(4096)
FC2…
softmax
(1000)
softmax…
CNN (3×3)
block_3
4 layers
(56x56x256)
CNN (3×3)…
CNN (3×3)
block_2
2 layers
(112x112x128)
CNN (3×3)…
CNN (3×3)
block_1
2 layers
(224x224x64)
CNN (3×3)…
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
maxpool
Deriving Gram Matrix from VGG-19 Image Classifier for Layer l=2
Deriving Gram Matrix from VGG-19 Image Classifier for La…
feature maps
64x
(224×224)
feature maps…
a1_1
a1_1
a1_2
a1_2
a1_3
a1_3
a1_j
a1_j
a2_1
a2_1
a2_2
a2_2
a2_3
a2_3
a2_j
a2_j
ai_1
ai_1
ai_2
ai_2
ai_3
ai_3
ai_j
ai_j
channel i (1 – 64)
channel i (1 – 64)
position j (1 – 50176 (224×224))
position j (1 – 50176 (224×224…
Matrix M
of
Flattened Feature Maps
Matrix M…
feature maps
of
CNN layer l=2
(conv1_2)
feature maps…
flattened
feature maps
flattened…
Gl=2
Gl=2
a1_1
a1_1
a2_1
a2_1
a3_1
a3_1
ai_1
ai_1
a1_2
a1_2
a2_2
a2_2
a3_2
a3_2
ai_2
ai_2
a1_j
a1_j
a2_j
a2_j
a3_j
a3_j
ai_j
ai_j
•
•
Transposed Matrix MT 
of
Flattened Feature Maps
Transposed Matrix MT…
=
=
c1_1
c1_1
c1_2
c1_2
c1_3
c1_3
c1_64
c1_64
c2_1
c2_1
c2_2
c2_2
c2_3
c2_3
c2_64
c2_64
c64_1
c64_1
c64_2
c64_2
c64_3
c64_3
c64_64
c64_64
=
=
Gram Matrix
for Layer l=2
with size 64×64
Gram Matrix…
Viewer does not support full SVG 1.1

Once the Gram matrix is calculated the style loss can be computed from the Gram matrix of the current image and the Gram matrix that has been determined from the style image at the beginning:

L_{style} = \sum_{l}\sum_{i,j}(\beta G_{i,j}^{stylized,l} - \beta G_{i,j}^{style,l})^2 

Experiments have shown that taking the activations from conv1_2, conv2_2, conv3_3, conv4_1 and conv5_1 generate good results.

Overall Loss

The overall loss is just adding content and style loss together.

Fast Neural Style Transfer

As you can imagine, it takes a long time to refine a single image this way. Each of the input image’s pixels needs to be adjusted during the creation process so the image produces a minimal overall loss. Fortunately, in 2016, a paper by Johnson et al introduced the idea that lead to Fast Neural Style Transfer.

Translation Network

Fast Neural Style Transfer adds a second network, a so called transformation network. The transformation network is an image translation network with an encoder-decoder architecture. It takes an input image and creates an output image of the same size from it.

The transformation network can be trained as any other network using the loss function values created by the feature extraction network. As a result it creates a transformation network that can take an input image and transform it in a single feed-forward pass into a stylized image. The weights in the feature extraction network stay frozen, the same as with the original neural style process.

Fast Neural Style Transfer

The transformation network is a simple CNN with residual blocks and stride convolutions. Those are used for down and up-sampling within the network.

Conclusion

The translation network is a lightweight CNN that, once trained, can stylize an image in a single feed-forward pass. The much deeper and more complex VGG19 network is no longer needed after training and doesn’t need to be applied to the edge device. This makes style transfer a candidate that can even run on smartphones, especially on those that come with specialized neural hardware like current iPhones and Android devices.

duup

One really nice app that comes with a clear UI design and very nice styles is duup. The app is available for iOS on the app store for free and provides a good feeling for what style transfer can do with your own pictures. I suggest you give it a try and see for yourself.

Leave a comment

Your email address will not be published. Required fields are marked *