Let’s look more closely at how sequence to sequence models work. We’ll start with a high level look and then go deeper and deeper. Here are our two recurrent nets. The one on the left is called the encoder. It reads the input sequence, then hands over what it has understood to the RNN and on the right, which we call the decoder. And the decoder generates the output sequence. The quote unquote “understanding” that gets handed over is a fixed size tenser that is called a state, or sometimes called the context. C is for context here. So no matter how short or long the inputs and outputs are, the context remains the same size that was declared when we built the model in the beginning. So at this high level, the inference process is done by handing inputs to the encoder. The encoder summarizes what it then understood into a context variable or state. And it hands it over to the decoder, which then proceeds to generate the output sequence. Now if we go a level deeper, we begin to see that since the encoder and decoder are both RNNs, they have loops, naturally, and that’s what allows them to process these sequences of inputs and outputs. Let’s take an example. Say our model is a jackpot, and we want to ask it, “How are you,” question mark. So first, we have to tokenize that input, and break it down into four tokens. And since it has four elements, it will take the RNN four timesteps to read in this entire sequence. Each time, it would read an input, and then do a transformation on its hidden state, then send that hidden state out to the next time step. The clock symbol indicates that we’re moving from one timestep to the next. One useful way to represent the flow of data through an RNN is by “unrolling,” quote unquote, the RNN. That is, graphing it to show each timestep as a separate cell. Even though, in practice, it’s just the same cell. Only it’s processing a new input, and it’s taking over the hidden state from the previous timestep. So, what’s a hidden state, you may ask. In the simplest scenario, you can think of it as a number of hidden units inside the cell. In practice, it’s more likely to be the hidden state inside a long, short-term memory cell, an LSTM. So, the size of the network is usually another hyperparameter that we can set to build the model. The bigger the hidden states, and the bigger the size, the more capacity of the model to learn, and look at patterns, and try to understand them. But the more resource intensive the model will be to train or deploy, in terms of processing and memory demand. So it’s that trade-off that you usually face with models, in general. A similar process happens on the decoder side, as well. So we begin by feeding it this data generated by the encoder. And it generates the output elements by elements. If we unroll the decoder, just like we did earlier with the encoder, so we can see that we are actually feeding it back every element that it outputs. This allows it to be more coherent as each timestep sort of remembers what the previous timestep has committed to. In the next video, we’ll go another level deeper into some of the internals of the architecture.