20 – RNN Summary

To summarize what we’ve discussed, we now understand that in RNNs, the current state depends on the inputs as well as on the previous states, with the use of an activation function, like the Hyperbolic Tangent, the Sigmoid or the ReLU function for example. The current output is a simple linear combination of the current state elements with the corresponding weight matrix. We can also use a softmax function to calculate the outputs. This is the Folded form of the RNN scheme. When we have only one hidden layer, we need to update three weight matrices. Wy connecting the state to the output, Ws connecting the state to itself and Wx connecting the input to the state. This is the Unfolded Model and it is the one we have been mostly using. Let’s look at the gradient calculations again. Calculating the gradient of the squared error or what we call the loss function, with respect to the weight matrix Wy, was straightforward. And by the way, you can choose to use other error functions of course. When calculating the gradient with respect to Ws and Wx, we need to be more careful and consider what happened in previous time steps accumulating all of these contributions. For these calculations, we used a process called Backpropagation Through Time. As we discussed earlier, when using backpropagation, we can choose to use mini batches. We can do the same in RNNs. Updating the weights in Backpropagation Through Time, can be performed periodically in batches as well, as opposed to once every input sample. As a reminder, we calculate the gradient for each step but do not update the weights right away. We can choose to update them once every fixed number of steps. For example, 20. This helps reduce the complexity of the training process and can also remove noise from the weight updates, since averaging a set of noisy samples tends to yield a less noisy value. You may ask yourselves, what happens when we have many time steps? Add not just a few as we had in our previous example. Remember the Backpropagation Through Time example? We had only three time steps. We may also have more than one hidden layer. So what happens then? Can we simply accumulate the gradient contributions from each of these time steps? The simple answer to that is no, we can’t. Studies have shown that for up to a small number of time steps say, eight or ten, we can effectively use this method. If we backpropagate further, the gradient will become too small which is known as the Vanishing Gradient Problem, where the contributions of the information decays geometrically over time or in other words, temporal dependencies that span many time steps, for example, more than eight or nine or ten time steps, will effectively be disguarded by the network. So, how do we address the Vanishing Gradient Problem? Long Short-Term Memory Cell or LSTM’s in short, were invented specifically to address this issue and they will be the topic of our next set of videos. The other scenario to be aware of is that of the Exploding Gradients, in which the value of the gradient grows uncontrollably. Luckily, a simple scheme called Gradient Clipping practically resolves this issue. Basically what we do, is we check in each time step whether the gradient exceeds a certain threshold. If it does, we normalize the successive gradient. Normalizing means that we penalize super large gradients, more than those that are slightly larger than our threshold. Clipping large gradients this way helps avoid the Exploding Gradient Problem. You can find more information about Gradient Clipping in the text following this video.

%d 블로거가 이것을 좋아합니다: