Welcome back. In this video, we’ll briefly recap how sequence to sequence models work. A sequence to sequence model takes in an input that is a sequence of items, and then it produces another sequence of items as an output. In a machine translation application, the input sequence is a series of words in one language, and the output is the translation in another language. In text summarization, the input is a long sequence of words, and the output is a short one. A sequence to sequence model usually consists of an encoder and a decoder, and it works by the encoder first processing all of the inputs turning the inputs into a single representation, typically a single vector. This is called the context vector, and it contains whatever information the encoder was able to capture from the input sequence. This vector is then sent to the decoder which uses it to formulate an output sequence. In machine translation scenarios, the encoder and decoder are both recurrent neural networks, typically LSTM cells in practice, and in this scenario, the context vector is a vector of numbers encoding the information that the encoder captured from the input sequence. In real-world scenarios, this vector can have a length of 256 or 512 or more. As a visual representation, we’ll start showing the hidden states as this vector of length four. Just think of the brightness of the cells corresponding to how high or low the value of that cell is. Let’s look at our basic example again, but this time we will look at the hidden states of the encoder as they develop. The first step, we process the first word and generate a first hidden state. Second step, we take the second word in the first hidden state as inputs to the RNN, and produce a second hidden state. In the third step, we process the last word and generate the last hidden state. This is the hidden state that would be the context vector will send to the decoder. Now, this here is the limitation of sequence to sequence models. The encoder is confined to sending a single vector no matter how long or short the input sequence is. Choosing a reasonable size for this vector makes the model have problems with long input sequences. Now, one can say, let’s just use a very large number of hidden units in the encoder, so that the context is very large. But then your model overfits with short sequences, and you take a performance hit as you increase the number of parameters. This is the problem that attention solves.