18 – 20 RNN BPTT B V5 Final

So, let’s unfold the model in time, and clean the sketch a bit, and focus on the third time step. In this model, we have three weight matrices that we want to modify. The weight matrix Wx linking the network inputs to the state or the hidden layer, the weight matrix Ws connecting one stage to the next, and the weight matrix Wy connecting this data to the output. Let’s start with adjusting Wy, which is the most straightforward to obtain. At time T equals 3, the derivative of the squared error with respect to Wy is found by a simple one step chain rule, and equals to the derivative of the squared error with respect to the output, multiplied by the derivative of the output with respect to the weight matrix Wy. As always, these derivatives will be calculated according to each element of the weight matrix. To adjust the other two matrices, we will need to use Backpropagation Through Time, and it doesn’t matter which one we choose to adjust first. Let’s choose to focus on the weight matrix Ws, the weight matrix connecting one state to the next, and remove anything we don’t need from the sketch. At first glance, it may seem that when finding the derivative with respect to Ws, we only need to consider state S3. This way the derivative of timestep, T equals 3, simply equals the derivative of the squared error with respect to the output, multiplied by the derivative of the output with respect to S3, and multiplied by the derivative of S3 with respect to the Matrix Ws. But S3 also depends on S2 and S1 the previous states. So, we can’t really stop there. We also need to take into account, what has happened before, and add to that contribution to our calculation. So, we will continue calculating the gradient, knowing that we need to accumulate the contributions, originating from each of the previous states. When we consider S2, we have the following path contributing to the backpropagation. We can clearly see that S3 depends on S2, giving us the derivative calculations using the chain rule, all the way back to the derivative of S2 with respect to the matrix Ws. But we’re not done yet, we need to go one more step back to the first state S1, giving us the calculations again by using the chain rule, all the way back to the derivative of S1 with respect to the matrix Ws. So, let’s look again at the accumulative gradient we now have by using Backpropagation Through Time, which we calculated considering all the state vectors that we have, state vector S3, state vector S2, and state vector S1. Generally, speaking we consider multiple time steps back, and need a general framework to define Backpropagation Through Time, for the purpose of changing Ws. So, what’s next? In our next video we will focus on adjusting the weight matrix Wx.

%d