# 18 – 19 RNN BPTT A V6 Final

Hopefully, you are now feeling more confident and have a deeper conceptual understanding of RNNS. But how do we train such networks? How can we find a good set of weights that would minimize the error? You will see that our training framework will be similar to what we’ve seen before, with a slight change in the back propagation algorithm. When training RNNS, we use what we call back propagation through time. For simplicity reasons, let’s decide that from this point, whenever I refer to partial derivatives, I will simply say derivatives, as I will need to refer to those quite often. And before I start, just a small reminder. If you feel that taking notes has been working for you, please continue to do so. To better understand back propagation through time, we need a few notational definitions. We’ve seen them before in a previous video but let’s emphasize them again. The state vector S_of_t is given by applying an activation function let’s say a hyperbolic tangent for example, to the sum of the product of the input vector X_of_t with a weight matrix Wx, and the product of the previous state vector S_of_t minus 1 with the weight matrix Ws. The output at time t simply equals to the product of the state vector S_of_t, with a weight matrix Wy, unless you’re also using a soft max function for example. And the loss function, the square error E_of_t equals to the square of the difference between the desired and the network output at time t. In previous lessons, you’ve seen other error functions such as the cross entropy loss for example. But for consistency, we will stay with the same error we’ve seen so far in this lesson. In back propagation through time, we don’t independently train the system at a specific time t. What we do, is we train the system at a specific time t as well as take into account all that has happened before. For example, assume that we are at timestep, t equals 3. Our square error remains as before the square of the difference between the desired output and the network output, in this case at time t equals three. This is the folded scheme at this particular time. To update each weight matrix, we need to find the derivatives of the loss function at time t equals three as a function of all of the weight matrices. In other words, we will modify each matrix using gradient descent just as we did before in feedforward neural networks. But in addition, we also need to consider the previous timesteps. To better understand how to continue with this process of back propagation through time, we will unfold this model. All this, in the next video.