There are a couple of ways you can connect the output of the CNN to the following RNN. But in all cases the feature vector that’s extracted from the CNN will go through some processing steps, to be used as input to the first cell in the RNN. It can sometimes prove useful to parse a CNN output through a additional fully-connected or linear layer before using it as an input to the RNN. This is similar to what you’ve seen in other transfer learning example. The CNN we’re using is pretrain. Adding an untrained linearly or at the end of it allow us to tweak only this final layer as we train the entire model to generate captions. After we extract a feature vector from the CNN and process it, we can then use this as the initial input into our RNN and the job of the RNN is to decode the process feature vector and turn it into natural language. This portion of the network is often called the decoder, and we’ll learn more about the Caption Generation and training this decoder next.