5 – Fixed Q-Targets

Experience replay helps us address one type of correlation. That is between consecutive experience tuples. There is another kind of correlation that Q-learning is susceptible to. Q-learning is a form of Temporal Difference or TD learning, right? Here, R plus gamma times the maximum possible value from the next state is called the TD target. And our goal is to reduce the difference between this target and the currently predicted Q-value. This difference is the TD error. Now, the TD target here is supposed to be a replacement for the true value function q pi (S, A) which is unknown to us. We originally used q pi to define a squared error loss, and differentiated that with respect to w to get our gradient descent update rule. Now, q pi is not dependent on our function approximation or its parameters, thus resulting in a simple derivative, an update rule. But, our TD target is dependent on these parameters which means simply replacing the true value function q pi with a target like this is mathematically incorrect. We can get away with it in practice because every update results in a small change to the parameters. We’re just generally in the right direction. If we set alpha equals one and leap towards the target then we’d likely overshoot, and land in the wrong place. Also, this is less of a concern when we use a lookup table or a dictionary since Q-values are stored separately for each state action pair. But, it can affect learning significantly when we use function approximation, where all the Q-values are intrinsically tied together through the function parameters. You may be thinking, “Doesn’t experience replay take care of this problem?” Well, it addresses a similar but slightly different issue. There we broke the correlation effects between consecutive experience tuples by sampling them randomly out of order. Here, the correlation is between the target and the parameters we are changing. This is like chasing a moving target, literally. In fact, it’s worse. It’s like trying to train a donkey to walk straight by sitting on it and dangling a carrot in front. Yes, the donkey might step forward and the carriage usually gets further away always staying a little out of reach. But, contrary to popular belief, this doesn’t quite work as you would expect. The carrot is much more likely to bounce around randomly throwing the donkey off with every jerky step each action affects the next position of the target in a very complicated and unpredictable manner. You shouldn’t be surprised if the donkey gets frustrated jumping around the spot and gives up. Instead, you should get off the donkey stand in one place and dangle the carrot from there. Once the donkey reaches that spot, move a few steps ahead, dangle another carrot and repeat. What you’re essentially doing is decoupling the target’s position from the donkey’s actions giving it a more stable learning environment. We can do pretty much the same thing in Q-learning by fixing the function parameters used to generate our target. The fixed parameters indicated by a w minus are basically a copy of w that we don’t change during the learning step. In practice, we copied w into w minus, use it to generate targets while changing w for a certain number of learning steps. Then, we update w minus with the latest w, again, learn for a number of steps and so on. This decouples the target from the parameters, makes the learning algorithm much more stable, and less likely to diverge or fall into oscillations.

%d 블로거가 이것을 좋아합니다: