The Deep Q-Network algorithm has caused a lot of buzz around Deep RL since 2013. It’s more or less an online version of a neural fitted value iteration paper from 2005 by Reed Miller and Martin. Which introduced training of acute value function represented by a multilayer perceptron. There are few very useful additions and tweaks though, in DQN. The first edition is the use of a rolly history of the past data via replay pool. By using the replay pool, the behavior distribution is average over many of its previous states, smoothing out learning and avoiding oscillations. This has the advantage that, each step of the experience is potentially used in many weight updates. The other big idea is the use of a target network to represent the old Q-function, which will be used to compute the loss of every action during training. Why not use a single network? Well, the issue is that at each step of training, the Q-functions values change and then the value estimates can easily spiral out of control. These additions enable RL agents to converge, more reliably during training.