The RNN component of the captioning network is trained on the captions in the COCO dataset. We’re aiming to train the RNN to predict the next word of a sentence based on previous words. But, how exactly can it train on string data? Neural networks do not do well with strings. They need a well-defined numerical alpha to effectively perform back propagation and learn to produce similar output. So, we have to transform the captioned associated with the image into a list of tokenize words. This tokenization turns any strings into a list of integers. So how does this tokenization work? First, we iterate through all of the training captions and create a dictionary that maps all unique words to a numerical index. So, every word we come across, will have a corresponding integer value that we can find in this dictionary. The words in this dictionaries are referred to as our vocabulary. The vocabulary typically also includes a few special tokens. In this example, we’ll add two special tokens to our dictionary, a start and end token that indicates the start and the end of a sentence. Now, the entire vocabulary is the length of the number of unique words in our training dataset plus two for the start and end tokens. Let’s take a look at this sample caption. A person doing a trick on a rail while riding a skateboard. This caption is then converted into a list of tokens with the special start and end tokens marking the beginning and end of the sentence. This list of token is then turned into a list of integers which come from our dictionary that maps each distinct word in the vocabulary to an integer value. This one more step before these words gets sent as input to an RNN and that’s the embedding layer, which transform each word in a caption into a vector of a desired consistent shape. After this embedding step, we’re finally ready to train an RNN that can predict the most likely next word in a sentence.