So, here’s another way to solve a local minimum problem. The idea is to walk a bit fast with momentum and determination in a way that if you get stuck in a local minimum, you can, sort of, power through and get over the hump to look for a lower minimum. So let’s look at what normal gradient descent does. It gets us all the way here. No problem. Now, we want to go over the hump but by now the gradient is zero or too small, so it won’t give us a good step. What if we look at the previous ones? What about say the average of the last few steps. If we take the average, this will takes us in direction and push us a bit towards the hump. Now the average seems a bit drastic since the step we made 10 steps ago is much less relevant than the step we last made. So, we can say, for example, the average of the last three or four steps. Even better, we can weight each step so that the previous step matters a lot and the steps before that matter less and less. Here is where we introduce momentum. Momentum is a constant beta between 0 and 1 that attaches to the steps as follows: the previous step gets multiplied by 1, the one before, by beta, the one before, by beta squared, the one before, by beta cubed, etc. In this way, the steps that happened a long time ago will matter less than the ones that happened recently. We can see that that gets us over the hump. But now, once we get to the global minimum, it’ll still be pushing us away a bit but not as much. This may seem vague, but the algorithms that use momentum seem to work really well in practice.