6 – Deep Q-Learning Algorithm

We’re now ready to take a look at the Deep Q-Learning Algorithm and implement it on our own. There are two main processes that are interleaved in this algorithm. One, is where we sample the environment by performing actions and store away the observed experienced tuples in a replay memory. The other is where we select the small batch of tuples from this memory, randomly, and learn from that batch using a gradient descent update step. These two processes are not directly dependent on each other. So, you could perform multiple sampling steps then one learning step, or even multiple learning steps with different random batches. The rest of the algorithm is designed to support these steps. In the beginning you need to initialize an empty replay memory D. Note that memory is finite, so you may want to use something like a circular Q that retains the N most recent experience tuples. Then, you also need to initialize the parameters or weights of your neural network. There are certain best practices that you can use, for instance, sample the weights randomly from a normal distribution with variance equal to two by the number of inputs to each neuron. These initialization methods are typically available in modern deep learning libraries like Keras and TensorFlow, so you won’t need to implement them yourself. To use the fixed Q targets technique, you need a second set of parameters w- which you can initialize to w. Now, remember that the specific algorithm was designed to work with video games. So, for each episode and each time step t within that episode you observe a raw screen image or input frame x t which you need to convert to grayscale crop to a square size, etc.. Also, in order to capture temporal relationships you can stack a few input frames to build each state vector. Let’s denote this pre-processing and stacking operation by the function phi, which takes a sequence of frames and produces some combined representation. Note that if we want to stack say four frames will have to do something special for the first, three time steps. For instance, we can treat those missing frames as blank, or just used copies of the first frame, or we can just skip storing the experience tuples till we get a complete sequence. In practice, you won’t be able to run the learning step immediately. You will need to wait till you have sufficient number of tuples in memory. Note that we do not clear out the memory after each episode, this enables us to recall and build batches of experiences from across episodes. There are many other techniques and optimizations that are used in the DQN paper, such as reward clipping, error clipping, storing past actions as part of the state vector, dealing with terminal states, digging epsilon over time, et cetera. I encourage you to read the paper, especially the methods section before trying to implement the algorithm yourself. Note that you may need to choose which techniques you apply and adapt them for different types of environments.