Now that we’ve taken a high level look at how attention works in a sequence to sequence model, let’s look into it in more detail. We’ll use machine translation as the example as that’s the application the main papers on attention tackled. But whatever we do here, translates into other applications as well. It’s important to note that there is a small variety of attention algorithms. We’ll be looking at a simple one here. Let’s start from the Encoder. In this example, the Encoder is a recurrent neural network. When creating an RNN, we have to declare the number of hidden units in the RNN cell. This applies whether we have a vanilla RNN or an LSTM or GRU cell. Before we start feeding our input sequence words to the Encoder, they have to pass through an embedding process which translates each word into a vector. Here we can see the vector representing each of these words. Now, this is a toy embedding of size four just for the purpose of easier visualization. In real-world applications, a size like 200 or 300 is more appropriate. We’ll continue to use these color-coded boxes to represent the vectors, just so we don’t have a lot of numbers plastered all over the screen. Now that we have our words and their embeddings, we’re ready to feed that into our Encoder. Feeding the first word into the first time step of the RNN produces the first hidden state. This is what’s called an unrolled view of the RNN, where we can see the RNN at each time step. We’ll hold onto this state and the RNN would continue to process the next time step. So, it would take the second word and pass it to the RNN at the second time step, and then it would do that with the third word as well. Now that we have processed the entire input sequence, we’re ready to pass the hidden states to the attention decoder.