End to end, we want our captioning model to take in an image as input and output a text description of that image. The input image will be processed by CNN and will connect the output of the CNN to the input of the RNN which will allow us to generate descriptive texts. To see how this works, let’s look at a particular example. Say we have a training image and associated caption which is a man holding a slice of pizza. Our goal is to train a network to generate this caption given the image as input. First, we feed this image into a CNN. We’ll be using a pre-trained network like VGG16 or Resnet. At the end of these networks, is a softmax classifier that outputs a vector of class scores. But we don’t want to classify the image, instead we want a set of features that represents the spatial content in the image. To get that kind of spatial information, we’re going to remove the final fully connected layer that classifies the image and look at it earlier layer that distills the spatial information in the image. Now, we’re using the CNN as a feature extractor that compresses the huge amount of information contained in the original image into a smaller representation. This CNN is often called the encoder because it encodes the content of the image into a smaller feature vector. Then we can process this feature vector and use it as an initial input to the following RNN.