The first problem we’re going to address, is the overestimation of action values that Q-learning is prone to. Let’s look back at the update rule for Q-learning with function approximation, and focus on the TD target. Here the max operation is necessary to find the best possible value we could get from the next state. To understand this better, let’s rewrite the target, and expand the max operation. It is just a more efficient way of saying, that we want to obtain the Q-value for the state S’, and the action that results in the maximum Q-value among all possible actions from that state. When we write it this way, we can see that it’s possible for the arg max operation to make a mistake, especially in the early stages. Why? Because the Q-values are still evolving, and we may not have gathered enough information to figure out the best action. The accuracy of our Q-values depends a lot on what actions have been tried, and what neighboring states have been explored. In fact, it has been shown that this results in an overestimation of Q-values, since we always pick the maximum among a set of noisy numbers. So, maybe we shouldn’t blindly trust these values. What can we do to make our estimation more robust? One idea that has been shown to work very well in practice is called Double Q-Learning, where we select the best action using one set of parameters w, but evaluate it using a different set of parameters w’. It’s basically like having two separate function approximators that must agree on the best action. If w picks an action that is not the best according to w’, then the Q-value returned is not that high. In the long run, this prevents the algorithm from propagating incidental high rewards that may have been obtained by chance, and don’t reflect long-term returns. Now you may be thinking, where do we get the second set of parameters from? In the original formulation of Double Q-Learning, you would basically maintain two value functions, and randomly choose one of them to update at each step using the other only for evaluating actions. But when using DQNs with fixed Q targets, we already have an alternate set of parameters. Remember w-minus? It turns out that since w-minus is kept frozen for a while, it is different enough from w that it can be reused for this purpose. And that’s it, this simple modification keeps Q-values in check, preventing them from exploding in early stages of learning or fluctuating later on. The resulting policies have also been shown to perform significantly better than vanilla DQNs.