Let’s look at how self-attention works in a little bit more detail. Let’s say, we have these words that we want our encoder to read and create a representation of. As always, we begin by embedding them into vectors. Since the transformer gives us a lot of flexibility for parallelization, this example assumes we’re looking at the process or GPU tasked with encoding the second word of the input sequence. First step is to compare them. So, we score the embeddings against each other. So, we have a score here and a score here, comparing this word that we’re currently reading or encoding with the other words in the input sequence. We scale the score. We basically divide by two here. The dimension of the keys, which we’re using a toy dimension of four, so that would be two. We do a softmax for these, and then, we multiply the softmax score with the embedding to get the level of expression of each of these vectors. The embedding of the current word, just [inaudible] as it is. We add them up, and that produces the self-attention context vector, if that’s something we’d like to call it. This is the image the authors of the paper showed when they presented this paper first at the NIPS Conference. Then, we’re looking at the second word. So, we have the words here, and then, they have their embeddings, and then, these are the vectors of the embeddings. So, this word is compared or scored against each of the other words in the input vector. This score is then multiplied by the embedding of that relevant word, and then, all of these are added up. We did not score the current word. We scored all the other words. After we add them up, we just pass this up to the feedforward neural network. If we implement it like this, however, we could see that the model is mainly focusing on other similar words, if we judge it only on the embedding of the word. So, there’s a little modification that we need to do here. We need to create queries out of each embedding. We do that by just multiplying by a query matrix or just passing it through a query feedforward neural network. We also create keys. So, we have another separate matrix of keys. We can calculate that again. So, we have our embeddings, we create the queries. We’re only processing the second word here. So, that’s where we created the query for. Then, we have our keys here. The scoring is comparing the query versus the key. So, that’s where we get these numbers here, 40, and then, 26. We scale softmax, and then, we multiply the softmax score with the key. That gets us the self-attention context vector after we add all of these together. This is an acceptable way of doing it, but there is a variation that we need to look at as well. So, these are our embeddings. We have our queries, which are made by multiplying the embedding by the Q matrix, which is learned from the training process. We have our keys, which are created by multiplying the embeddings by the K matrix, and we have our values, which are produced the same way by multiplying by the V matrix, which is also learned in the training process. This is a graphic from the authors as well in their NIPS presentation, where they outline how to create the key, the query, and the value. So, this is the embedding. We multiply it by V to get the value. We multiply it by Q to get the query. You multiply it by K to get the key. So, the final form of self-attention, as presented in this paper, is, we have our embeddings. We’ve calculated our V, Q, our values, keys, and queries. We score the queries against the keys, and then, that softmax score is multiplied by the values. These we add up and pass to the feedforward neural network. This is a very high-level view at this model here and discussion of the self-attention concept. In the text below the video, we’ll link to the paper and some implementations of it, if you’re interested to go a little bit deeper into the transformer.