Hello, I’m Jay, and in this lesson, we’ll be talking about one of the most important innovations in deep learning in the last few years, Attention. Attention started out in the field of computer vision as an attempt to mimic human perception. This is a quote from a paper on Visual Attention from 2014. It says that, “One important property of human perception is that one does not tend to process a scene in its entirety all at once. Instead humans focus attention selectively on parts of the visual space to acquire information when and where it’s needed, and then combine information from different fixations over time to build up an internal representation of the entire scene, guiding future eye movement and decision-making.” What that means is that when we look at a scene in our daily lives, our brains do not just process a visual snapshot all at once. Instead we selectively focus on different parts of the image, and we sequentially collect and process that visual information over time. Say, for example, you’re shopping in a shopping mall. Now, the footage that we’re going to see here is from an eye-tracking device. If you haven’t seen those before, it’s a device that records both what’s in front of you and it also records your eye movement. Then, we can overlay these two recordings so we can have an idea of where you were looking at each time in the video. So, what we’re seeing here is footage from a person wearing the eye-tracking device, where the orange circle is highlighting where the person is looking at each moment. So, we can see attention in general visual perception, but you can also see it in reading and trying to process text one word at a time. This type of device is used, for example, in user experience testing. If you wanted to build a user interface for an app or website and you wanted to track if the most important things are actually grabbing the attention of the users, and since that’s how humans tend to understand a visual picture sequentially over time, the idea from computer vision researchers was to try to adopt a method that does that for computer vision models. In machine learning, attention methods give us a mechanism for adding selective focus into a machine learning model. Typically, one that does its processing sequentially. Attention is a concept that powers up some of the best performing models spanning both natural language processing and computer vision. These models include: neural machine translation, image captioning, speech recognition, and text summarization, as well as others. Take image classification in captioning as an example. Before the use of attention, convolutional neural networks were able to classify images by looking at the whole image and outputting a class label. But not all of this image is necessary to produce that classification; only some of these pixels are needed to identify a bird, and attention came out of the desire to attend to these most important pixels. Now, not only that, but attention also improved our ability to describe images with full sentences by focusing on different parts of the image as we generate our output sentence. Attention achieved its rise to fame, however, from how useful it became in tasks like neural machine translation. As sequence to sequence models started to exhibit impressive results, they were held back by certain limitations that made it difficult for them to process long sentences, for example. Classic sequence to sequence models, without attention, have to look at the original sentence that you want to translate one time and then use that entire input to produce every single small outputted work. Attention, however, allows the model to look at this small relevant parts of the input as you generate the output over time. When attention was incorporated in sequence to sequence models, they became the state of the art in neural machine translation. This is what like Google to adopt neural machine translation with attention as the translation engine for Google translate in the end of 2016. In this lesson, we’ll look at how attention works and how and where it can be applied.