In this concept, we’ll go over some of the computer vision applications and tasks that attention empowers. In the text below the video, we’ll link to a number of papers in case you want to go deeper into any specific application or task. In this video, we’ll focus on image captioning and one of the key papers from 2016 titled, “Show, Attend and Tell.” This paper presented a model that achieved the state of the art performance in caption generation in a number of datasets. For example, when presented with an image like this, the generated caption was, “A woman is throwing a frisbee in a park.” When presented with an image like this, the generated caption was, “A giraffe standing in a forest with trees in the background.” Models like these are trained on a dataset like the MS Coco, which has a set of about 200,000 images, each with five captions written by people. This is sourced through something like Amazon’s Mechanical Turk Service. For example, this image from the dataset has these five captions as it’s labels and this is the dataset that we used to train a model like this. If we look under the hood, the model is very similar to the sequence to sequence models we’ve looked at earlier in the lesson. In this case, the model takes the image as an input to it’s encoder, the encoder generates a context, passes it to the decoder. The decoder then proceeds to output a caption. The model generates the caption sequentially and uses attention to focus on the appropriate place of the image as it generates each word of the caption. For example, when presented with this image, in the first step, the trained model focuses on this region. So, this is the thumbnail of this image. The white areas is where the model is paying the most attention right now. So, we can see that it’s mainly focused on the wings. It then outputs the first element in the output sequence or the caption, which is the word “a.” The next step at the decoder, it focuses on this region, so mainly the body of the bird as you see, and the output would be “bird”, and then it expands it’s focus area to this area around the bird to try to figure out what to describe next, and the output at this step is “flying”. This goes on. We can see how it’s attention. Right now is starting to radiate out of the bird and focus on things behind it or around it. So, it’s generating “a bird flying over a body of”, and then the focus here completely sort of ignores the bird, and trying to look at everywhere else in the image, “water.” The first time I looked at something like this, image captioning specifically, it was mind-blowing to me, but now, we have an idea of how it works. The model here is made of an encoder and a decoder as we’ve mentioned. The encoder in this case is a convolutional neural network that produces a set of feature vectors, each of which corresponds to a part of the image or a feature of the image. To be more exact, the paper used a VGG net convolutional network trained on the Image net. The annotations were created from this feature map. This feature volume has dimensions of 14 x 14 x 512, meaning that it has 512 features, each of them has the dimensions of 14 x 14. To create our annotation vector, we need to flatten each feature, turning it from 14 x 14 to 196 x 1. So, this is simply reshaping the matrix. After we reshape it to end up with a matrix of 196 x 512. So, we have 512 features, each one of them is a vector of 196 numbers. So, this is our context vector. We can proceed to use it just like we’ve used the context vector in the previous videos, where we score each of these features and then we merge them to produce our attention context vector. The decoder is a recurrent neural network, which uses attention to focus on the appropriate annotation vector at each time step. We plug this into the attention process we’ve outlined before and that’s our image captioning model. Be sure to check the text below the video for some very exciting applications in computer vision for attention.