2 – Learning Rate

The learning rate is the most important hyper parameter. Even if you apply models that other people built to your own data set, you’ll find that you’ll probably have to try a number of different values for the learning rate to get the model to train properly. If you took care to normalize the inputs to your model, then a good starting point is usually 0.01. And these are the usual suspects of learning rates. If you try one and your model doesn’t train, you can try the others from this list. Which of the others should you try? That depends on the behavior of the training error. To better understand this, we’ll need to look at the intuition of the learning rate. Earlier in the course, we saw that when we use gradient descent to train a neural network model, the training task boils down to decreasing the error value calculated by loss function as much as we can. During a learning step, we do a forward pass through the model, calculate the loss, then find the gradient. Let’s assume the simplest case in which our model has only one weight. The gradient will tell us which way to nudge the current weight so that our predictions become more accurate. To visualize the dynamics of the learning rate, let us plot the value of the weight versus the error value, we obtain by using it to calculate a prediction for a random training data point. In the perfect case, and if we zoom in to the correct part of the curve, the relationship of the weight and error values look like this idealized U-shape. Choosing a random weight and calculating the error value would give us a point like this one on the curve. Note that we don’t know what the curve looks like when we start. The only way to know that is to calculate the error at every weight point. We’re only drawing it here to clarify these dynamics. Calculating the gradient would tell us which direction to go to decrease the error. If we do the calculation correctly, the gradient will point out which direction to go, meaning whether we should increase or decrease the current value of the weight. The learning rate is the multiplier we use to push the weight towards the right direction. Now, if we had made a miraculously correct choice for our learning rate, then we’d land on the best weight after only one training step. If the learning rate we chose was smaller than the ideal rate, then that’s OK. Our model can continue to learn until it finds a good value for the weight. So each training step it’ll take a step closer, until it lands on that best weight there. If however the learning rates had been too little, then our training error would be decreasing but very slowly, and we might go through hundreds or thousands of training steps without reaching the best value for our model. And it’s clear that in cases like this, what we need to do is to increase the learning rate. One other case is that if we chose a learning rate that is larger than the ideal learning rate. Our updated value would overshoot the ideal weight value. And then on the next update, it over shoot the other way, but will keep getting closer, and it would probably converge to a reasonable value. Where this becomes problematic though is when we choose a learning rate that is much larger than the ideal rate, more than twice as much. So in this case, we will see the weight taking a large step that not only overshoots the ideal weight, but it actually gets farther and farther from the best error that we can get at every step. A contributor to this divergence is the gradient. The gradient does not only contribute a direction but also a value that corresponds to the slope of the line tangent to the curve at that point. The higher the point is on the curve, the more steep the slope is and the larger the value of the gradient is. So this makes the problem of the large learning rate even worse. So if our training error is actually increasing, we might want to try to decrease the learning rates and see what happens. These are the general cases you’ll come across when tuning your learning rate. But note that this here is a simple example with only one parameter and an ideal convex error curve. Things are more complicated in the real world, I’m sure you’ve seen. Your models are likely to have hundreds or thousands of parameters each with its own error curve that changes as the values of the other weights change. And the learning rate has to shepherd all of them to the best values that produce the least error. To make matters even more difficult for us, we don’t actually have any guarantees that the error curves will be clean U-shapes. They might in fact be more complex shapes with local minima that the learning algorithm can mistake for the best values and converge on. This figure is obviously an oversimplification because the error curve of three parameters versus the error value is actually a plane in four dimensional space and hard to visualize. So, think of this complexity that these incredible algorithms can help us overcome. It’s easy to be intimidated by a thousand or a million weights each with their own error curve that depends on all the other values, plus each weight having a random starting value and our diligent learning rate pushing them left and right, in order to fit our training data and find the best model. But this is a monster you have slain over and over again in the exercises and projects of this nanodegree. I just wanted to take a second and reflect on the incredible power we wield armed with these brilliant algorithms. Now that we looked at the intuition of the learning rates and the indications that the training area gives us that can help us tune the learning rate, let’s look at one specific case we can often face when tuning the learning rate. Think of the case where we chose a reasonable learning rate. It manages to decrease the error, but up to a point after which it’s unable to descend, even though it didn’t reach the bottom yet. It would be stuck oscillating between values that still have a better error value than when we started training, but are not the best values possible for the model. This scenario is where it’s useful to have our training algorithm decrease the learning rate throughout the training process. This is a technique called learning rate decay. Intuitive ways to do this can be by decreasing the learning rate linearly. So say decrease it by half every five epochs like this example here. You can also decrease the learning rate exponentially. So, for example we’d multiply the learning rate by 0.1 every 8 epochs for example. In addition to simply decreasing the learning rate, there are more clever learning algorithms that have an adaptive learning rate. These algorithms adjust the learning rate based on what the learning algorithm knows about the problem and the data that it’s seen so far. This means not only decreasing the learning rate when needed, but also increasing it when it appears to be too low. Below this video, you’ll find some instructions for using an adaptive learning algorithm in Tensorflow.

%d 블로거가 이것을 좋아합니다: