The overall architecture for the neural network that’s being used is the same as VGG-19, which consists of 5 pooling layers for downsampling and several convolutional layers.
We do not use a neural network in a “true sense”, that is, we aren’t really training the network to do anything in specific. We’re simply taking advantage of the backpropagation to minimize two defined loss values. The tensor which we backpropagate into is the stylized image. We input the picture that we want to transfer the style onto, the “content image”, as well as the “style image” from which we want to extract the patterns. We want to initialize the tensor to be random noise. The tensor, along with the content and style images, are then passed through several layers of a network that is pretrained on image classification. We use the outputs of various intermediate layers to compute style loss and content loss. That is, how close is the image to the style image in style, and how close is the image to the content image in content. Those losses are then minimized by directly changing the image.
We use the outputs of intermediate layers of a pretrained image classification network to compute our style and content losses. For a network to be able to do image classification, it has to understand the image, and the process of taking the image as input and outputting its guesses is useful to us. In such processes, the neural network is essentially doing transformations to turn the image pixels into an internal understanding of the “content”.
When we pass both the altered image and the content image through some layers of an image classification network and find the Euclidean distance between the intermediate representations of those images, the resulting value is the content loss. This can be evaluated by the following equations.