A Sequence to Sequence Model with attention works in the following way. First, the encoder processes the input sequence just like the model without attention one word at a time, producing a hidden state and using that hidden state and the next step. Next, the model passes a context vector to the decoder but unlike the context vector in the model without attention, this one is not just the final hidden state it’s all of the hidden states. This gives us the benefit of having the flexibility in the context size. So longer sequences can have longer contexts vectors that better capture the information from the input sequence. One additional point that’s important for the intuition of attention, is that each hidden state is sort of associated the most with the part of the input sequence that preceded how that word was generated. So, the first hidden state was outputted after processing the first word, so it captures the essence of the first word the most. So when we focus on this vector, we will be focusing on that word the most, the same with the second hidden state with the second word with the third word, even though that last and third vector incorporates a little bit of everything that preceded it as well.