Since the two main Attention papers were published in 2014 and ’15, Attention has been an active area of research with many developments. While the two mechanisms continue to be commonly used, there have been significant developments over the years. In this video, we will look at one of these developments published in a paper titled Attention Is All You Need. This paper noted that the complexity of encoder-decoder with Attention models can be simplified by adopting a new type of model that only uses Attention, no RNNs. They called this new model the Transformer. In two of their experiments on machine translation tasks, the model proved superior in quality as well as requiring significantly less time to train. The Transformer takes a sequence as an input and generate a sequence, just like the sequence-sequence models we’ve seen so far. The difference here, however, that it does not take the inputs one by one, as in the case of an RNN. It can produce all of them together in parallel. Perhaps each element is processed by a separate GPU if we want. It then produces the output one by one but also not using an RNN. The Transformer model also breaks down into an encoder and a decoder. But instead of RNNs, they use feed-forward neural networks and a concept called self-attention. This combination allows the encoder and decoder to work without RNNs, which vastly improves performance since it allows parallelization of processing that was not possible with RNNs. The Transformer contains a stack of identical encoders and decoders. Six is the number the paper proposes. Let’s focus on the encoder in more layer and look at it more closely. Each encoder layer contains two sublayers: a multi-headed self-attention layer and a feed-forward layer. As you might notice, this Attention component is completely on the encoder side as opposed to being a decoder component like the previous Attention mechanisms we’ve seen. This Attention component helps the encoder comprehend its inputs by focusing on other parts of the input sequence that are relevant to each input element it processes. This idea is an extension of work previously done on the concept of self-attention and how it can aid comprehension. In one paper, for example, this type of Attention is used in the context of machine reading, where the experiments on this technique matched or outperformed the state of the art at that time in tasks like language modeling, sentiment analysis and natural language inference. They still used RNNs but they augmented it with this idea that later became self-attention. The example they used in this machine reading paper shows where the train model pays attention as it reads each word. So, for example, when the model reads the sentence using an LSTM, it learns which other parts of the input to pay attention to as it processes each word of the input. So, the red is where it’s reading and the blue is where it’s paying attention as it’s reading this word. At each step, it reads a word and it pays attention to the relevant previous words that would aid in comprehending that word. The structure of the Transformer, however, allows the encoder to not only focus on previous words in the input sequence, but also on words that appeared later in the input sequence. This, however, is not the only Attention component in the Transformer. The decoder contains two Attention components. One that allows it to focus on the relevant part of the inputs and another that only pays attention to previous decoder outputs, and there you have it. A high-level view of the components of the Transformer. We can see how extensively this model uses Attention. We can see three Attention components here. They don’t all work exactly the same way, but they all boil down pretty much to multiplicative attention, which we already understand.