This captioning network can also be applied to video, not just single images. In the case of video captioning, the only thing that has to change about this network architecture is the feature extraction step that occurs between the CNN and the RNN. The input to the pre-trained CNN will be a short video clip which can be thought of as a series of image frames. For each image frame, the CNN will produce a feature vector. But our RNN cannot accept multiple feature vectors as input, so we have to merge all of these feature vectors into one that is representative of all image frames. One way of doing that is to take an average over all of the feature vectors created by the set of image frames. This produces a single average feature vector that represents the entire video clip. Then we can proceed as usual, training the RNN, and then using the single vector as the initial input and training on a set of tokenized video captions.