19 – 21 RNN BPTT C V7 Final

We still have to adjust Wx. The weight matrix connecting the input layer to the hidden or state layer. Let’s simplify the sketch and leave only what we need. You will see that the process we follow to adjust Wx will be very similar to the one we used when updating Ws. Having said that, let’s go over the process in detail. If we look at timestep, t equals three, the error with respect to the Matrix Wx depends not only on vector S3 but also on S2 and its predecessor S1, which are all affected by the same matrix Wx. At first glance, it may seem that we need to consider only vector S3. So the derivative of timestep t equals three, by using the chain rule of course, simply equals to the derivative of the square error with respect to the output y3 multiplied by the derivative of the output with respect to S3. And finally, multiplied by the derivative of S3 with respect to the matrix Wx. But let’s go back for a bit, as we said, S3 also depends on S2 and S1, which are all affected by the same matrix Wx. So the gradient that we’re looking for is not only the product of the three derivatives we just saw but it is the accumulation of all of the contributions originating from each of the previous states. So let’s consider the previous state, S2. Again, by using the chain rule, we can see the following path giving us an additional contribution to the overall gradient. But we are not done yet. We have one more state, S1, to consider. And we will add its contribution to the overall accumulative gradient. Starting from the output and back propagating to the first state, we will provide the following additional component to the overall gradient. Let’s look again at the accumulative gradient we have using backpropagation through time which we calculated considering all the state vectors we have, state S3 S2 and S1. This is the complete gradient needed for the purpose of correctly updating the matrix Wx. Generally speaking, we need to consider multiple past timesteps and not just three, as in this example and need a general framework to define backpropagation through time for the purpose of updating Wx.

%d 블로거가 이것을 좋아합니다: