In feedforward neural networks, the output at any time is a function of the current input and the weights alone. We assume that the inputs are independent of each other. Therefore, there is no significance to the sequence. So we actually train the system by randomly drawing inputs and target pairs. In RNNs, our output at time t, depends not only on the current input and weights, but also on previous inputs. Such as the input to time T minus 1, T minus 2 and so on. So, how can we visualize that? Let’s look at our neural network again. We have an input x and output y and a hidden layer s. Remember s? It stands for State, which is a term we use when the system has memory. Wx represents the weight matrix, connecting the inputs to the state layer. Wy represents the weight matrix, connecting the state to the output and Ws represents the weight matrix, connecting the state from the previous timestep to the state in the next timestep. Notice that the single s is fed back to the system. In every single timestep, the system will look the same. This is what we call the folded model. Since the input is spread over time and we performed the same task for every element in the sequence, we can unfold the model in time and represent it the following way. Just as an example, you can see that the output at time T plus 2, depends on the input at time T plus 2 on the way matrices and also on all previous inputs. Let’s use the following definitions, x of t is the input vector at time t, y of t is the output vector at time t and s of t is the hidden state vector at time t. In feedforward neural networks, we use an activation function to obtain the hidden layer h. And all we need are the inputs and the weight matrix connecting the inputs to the hidden layer. In RNNs, we use an activation function to obtain s, but with a slight twist. The input to the activation function is now the sum of one, the product of the inputs and their corresponding weight matrix, Wx and two, the product of the previous activation values and their corresponding weight matrix Ws. The output vector is calculated exactly the same as in feedforward neural networks. It can be a linear combination of the inputs to each output node with a corresponding weight matrix Wy or for example a soft max function of the same linear combination. Notice, that RNNs share the same parameters during each timestep. So although intuitively, it appears that RNNs are more complicated than feedforward neural networks since they have memory. In practice, the number of parameters that need to be learned remains modest. The unfolding scheme we mentioned is generic and can be adjusted according to the neural network architecture we aim to build. We can decide how many inputs and outputs we need. For example in sentiment analysis, we can have many inputs and a single output that spans the spectrum of happy to sad. Another example can be time series predictions, where we may have many inputs and many outputs which are not necessarily aligned. RNNs can be stacked like Legos, just as feedforward neural networks can. For example, we can have the output of a single RNN level, say, vector y become the input to a second layer whose output is vector o. Each layer operates independently of the other layers, as architecturally it doesn’t matter where the inputs come from or where the outputs are headed to.