We looked at Convolutional Neural Networks that are used for image classification and object localization. And we looked at Recurrent Neural Networks mostly in the context of text generation. You’ve seen how networks like LSTMs can learn from sequential data, like a series of words or characters. These networks use hidden layers that over time link the output of one layer to the input of the next. This creates a kind of memory loop that allows the network to learn from previous information. In this section, we’ll see how we can put these two types of networks together to create an automatic image captioning model that takes in an image as input and outputs a sequence of text that describes the image. Image captions are used in a variety of applications. For example, image captions can be used to describe images to people who are blind or have low vision and who rely on sounds and texts to describe a scene. In web development, it’s good practice to provide a description for any image that appears on the page so that an image can be read or heard as opposed to just seen. This makes web content accessible. Similarly, captions can be used to describe video in real time. You can imagine a lot of use cases for which automatically generated captions will be really useful. This lesson and the following project will be all about creating a model that can produce descriptive captions for an image. To help us learn about the architecture of an image captioning model, next you’ll be introduced to Kelvin Lwin, an industry expert who works for the Deep Learning Institute at Nvidia.