All right. The next issue we’ll look at is related to experience replay. Recall the basic idea behind it. We interact with the environment to collect experience tuples, save them in a buffer, and then later, we randomly sample a batch to learn from. This helps us break the correlation between consecutive experiences and stabilizes our learning algorithm. So far so good. But some of these experiences may be more important for learning than others. Moreover, these important experiences might occur infrequently. If we sample the batches uniformly, then these experiences have a very small chance of getting selected. Since buffers are practically limited in capacity, older important experiences may get lost. This is where the idea of prioritized experience replay comes in. But what criteria should we use to assign priorities to each tuple? One approach is to use the TD error delta. The bigger the error, the more we expect to learn from that tuple. So, let’s take the magnitude of this error as a measure of priority and store it along with each corresponding tuple in the replay buffer. When creating batches, we can use this value to compute a sampling probability. Select any tuple i with a probability equal to its priority value PI, normalize by the sum of all priority values in the replay buffer. When a tuple is picked, we can update its priority with a newly computed TD error using the latest q values. This seems to work fairly well and has been shown to reduce the number of batch updates needed to learn a value function. There are a couple of things we can improve. First, note that if the TD error is zero, then the priority value of the tuple and hence its probability of being picked will also be zero. Zero or very low TD error doesn’t necessarily mean we have nothing more to learn from such a tuple, it might be the case that our estimate was closed due to the limited samples we visited till that point. So, to prevent such tuples from being starved for selection, we can add a small constant e to every priority value. Another issue along similar lines is that greedily using these priority values may lead to a small subset of experiences being replayed over and over resulting in a overfitting to that subset. To avoid this, we can reintroduce some element of uniform random sampling. This adds another hyperparameter A which we use to redefine the sampling probability as, priority Pi to the power A divided by the sum of all priorities Pk, each raised to the power A. We can control how much we want to use priorities versus randomness by varying this parameter. Wher A equals zero corresponds to pure uniform randomness and A equals one only uses priorities. When we use prioritized experience replay, we have to make one adjustment to our update rule. Remember that our original Q learning update is derived from an expectation over all experiences. When using a stochastic update rule, the way we sample these experiences must match the underlying distribution they came from. This is preserved when we sample experience tuples uniformly from the replay buffer, but this assumption is violated when we use a non-uniform sampling, for example, using priorities. The q values we learn will be biased according to these priority values which we only wanted to use for sampling. To correct for this bias, we need to introduce an important sampling weight equal to, one over n, where n is the size of this replay buffer, times one over the sampling probability Pi. We can add another hyperparameter B and raise each important sampling weight to B, to control how much these weights affect learning. In fact, these weights are more important towards the end of learning when your q values begin to converge. So, you can increase B from a low value to one over time. Again, these details may be hard to understand at first, but each small improvement can contribute a lot towards more stable and efficient learning. So, make sure you give the prioritized experience replay paper a good read.