Now, let’s look at the attention decoder and how it works at a very high level. At every time step, an attention decoder pays attention to the appropriate part of the input sequence using the context factor. How does the attention decoder know which of the parts of the input sequence to focus on at each step? That process is learned during the training phase, and it’s not just stupidly going sequentially from the first and the second to the third. It can learn some sophisticated behavior. Let’s look at this example of translating a French sentence to an English one. So let’s say we have this input sentence in French. Let’s say we pass this to our encoder and now we’re ready to look at each step in the decoding phase. In the first step, the attention decoder would pay attention to the first part of the sentence. This is a trained model, right. So the more light the square is is the more attention that he gave to that word in particular. So it pays attention to the first word and it outputs a first English word. In the second step, it pays attention to the second word in the input sequence and translates that word as well. It goes on sequentially for about four steps and it produces reasonable English translation so far. Then something different happens here in the fifth step. So, when we’re generating the fifth word of the output, the attention actually jumped two words to translate European. So, we have zone, economique, europeenne, so on the English side it’s not going to be in the same order. So, europeenne is translated as European and then in the next step it focuses on the word before that, economique, economic, and it focuses on zone and it outputs area. This is a case where the order of these words in the French language does not follow how it would be ordered in the English language and the model was able to learn that just from a training data set. The rest of the sentence goes on pretty much sequentially. So, this is a really cool example of how attention is able to make these models focus on the right parts at the right moments based on what dataset we have.