There are many optimizers in Keras, that we encourage you to explore further, in this link, or in this excellent blog post. These optimizers use a combination of the tricks above, plus a few others. Some of the most common are:
This is Stochastic Gradient Descent. It uses the following parameters:
- Learning rate.
- Momentum (This takes the weighted average of the previous steps, in order to get a bit of momentum and go over bumps, as a way to not get stuck in local minima).
- Nesterov Momentum (This slows down the gradient when it’s close to the solution).
Adam (Adaptive Moment Estimation) uses a more complicated exponential decay that consists of not just considering the average (first moment), but also the variance (second moment) of the previous steps.
RMSProp (RMS stands for Root Mean Squared Error) decreases the learning rate by dividing it by an exponentially decaying average of squared gradients.