9 – 08 Video Captioning V1

This captioning network can also be applied to video, not just single images. In the case of video captioning, the only thing that has to change about this network architecture is the feature extraction step that occurs between the CNN and the RNN. The input to the pre-trained CNN will be a short video clip … Read more

8 – 07 RNN Training V4

Let’s take a closer look at how the decoder trains on a given caption. The decoder will be made of LSTM cells, which are good at remembering lengthy sequence of words. Each LSTM cell is expecting to see the same shape of the input vector at each time-step. The very first cell is connected to … Read more

7 – Tokenization

Token is a fancy term for a symbol. Usually, one that holds some meaning and is not typically split up any further. In case of natural language processing, our tokens are usually individual words. So tokenization is simply splitting each sentence into a sequence of words. The simplest way to do this is using the … Read more

6 – 06 Tokenizing Captions V3

The RNN component of the captioning network is trained on the captions in the COCO dataset. We’re aiming to train the RNN to predict the next word of a sentence based on previous words. But, how exactly can it train on string data? Neural networks do not do well with strings. They need a well-defined … Read more

3 – 03 Captions And The COCO Dataset V3

The first thing to know about image captioning model is how it will train. Your model will learn from a dataset composed of images, pair with captions that describe the content of those images. Say you’re asked to write a caption that describe this image, how would you approach this task? First, you might look … Read more

2 – 02 Leveraging Neural Networks V3

Hi, I’m Calvin Lin and I work for Deep Learning Institute at NVIDIA, where we’re tasked with enabling everyone in the industry to create AI companies. For that, we need very capable AI and image captioning is where we go from mere perception modules to ones with generative capabilities. A captioning model rely on two … Read more

10 – 09 On To The Project V2

Now, that you’ve learned about the structure of an automatic captioning system, your next task will be to apply what you’ve learned to build and train your own captioning network. You’ll be provided with some data pre-processing steps, and your main tasks will be about deciding how to train the RNN portion of the model. … Read more

1 – 01 L Introduction V3

We looked at Convolutional Neural Networks that are used for image classification and object localization. And we looked at Recurrent Neural Networks mostly in the context of text generation. You’ve seen how networks like LSTMs can learn from sequential data, like a series of words or characters. These networks use hidden layers that over time … Read more