4 – Experience Replay

The idea of experience replay and its application to training neural networks for reinforcement learning isn’t new. It was originally proposed to make more efficient use of observed experiences. Consider the basic online Q-learning algorithm where we interact with the environment and at each time step, we obtain a state action reward next state tuple. Learn from it and then discarded. Moving on to the next tuple in the following timestep. This seems a little wasteful. We could possibly learn more from these experienced tuples if we stored them somewhere. Moreover, some states are pretty rare to come by and some actions can be pretty costly, so it would be nice to recall such experiences. That is exactly what a replay buffer allows us to do. We store each experienced tuple in this buffer as we are interacting with the environment and then sample a small batch of tuples from it in order to learn. As a result, we are able to learn from individual tuples multiple times, recall rare occurrences, and in general make better use of fire experience. But there is another critical problem that experience replay can help with and this is what DQN takes advantage of. If you think about the experiences being obtained, you realize that every action AT affects the next state ST in some way, which means that a sequence of experienced tuples can be highly correlated. A naive Q-learning approach that learns from each of these experiences in sequential order runs the risk of getting swayed by the effects of this correlation. With experience replay, can sample from this buffer at random. It doesn’t have to be in the same sequence as we stored the tuples. This helps break the correlation and ultimately prevents action values from oscillating or diverging catastrophically. Now, this might be a little hard to understand, so let’s look at an example. I’m learning to play tennis and I like to practice rallying against the wall. I’m more confident with my forehand shot than my backhand, and I can hit the ball fairly straight. So the ball keeps coming back around the same area to my right and I keep hitting forehand shots. Now, if I were an online Q-learning agent learning as I play, this is what I might pick up. When the ball comes to my right, I should hit with my forehand, less certainly at first but with increasing confidence as I repeatedly hit the ball. Great! I’m learning to play forehand pretty well, but I’m not exploring the rest of the state space. This could be addressed by using an Epsilon-Greedy policy acting randomly with a small chance. So I try different combinations of states and actions and sometimes I make mistakes, but I eventually figure out the best overall policy. Use a forehand shot when the ball comes to my right and a backhand when it comes to my left. Perfect, this learning strategy seems to work well with a Q-table, where we have assumed this highly simplified state space with just two discrete states. But when we consider a continuous state space things can fall apart. Let’s see how. First the ball can actually come anywhere between the extreme left and extreme right. If I tried to discretize this range into small buckets I wouldn’t have too many possibilities. What if I end up learning a policy with holes in it, states or situations that I may not have visited during practice. Instead it makes more sense to use a function approximator like a linear combination of RBF kernels or a Q-network that can generalize my learning across the space. Now, every time the ball comes to my right and I successfully hit a forehand shot, my value function changes slightly. It becomes more positive around the exact region where the ball came, but also raises the value for the forehand shot in general across the state space. The effect is less pronounced away from the exact spot, but over time it can add up. And that’s exactly what happens when I try to learn while playing, processing each experience tuple in order. For instance, if my forehand shot is fairly straight, I likely get back the ball around the same spot. This produces a state very similar to the previous one, so I use my forehand again and if it is successful it reinforces my belief that forehand is a good choice. I can easily get trapped in this cycle. Ultimately, if I don’t see too many examples of the ball coming to my left for a while, then the value of the forehand shot can become greater than backhand across the entire state space. My policy would then be to choose forehand regardless of where I see the ball coming. Disaster. Okay, how can we fix this? The first thing I should do is stop learning while practicing. This time is best spent in trying out different shots playing a little randomly and thus exploring the state space. It then becomes important to remember my interactions, what shots worked well in what situations, et cetera. When I take a break or when I am back home and resting, that’s a good time to recall these experiences and learn from them. The main advantage is that now I have a more comprehensive set of examples. Somewhere the ball came to my right, somewhere it came to my left, some forehand shots, some backhand. I can generalize patterns from across these examples, recalling them in whatever order I please. This helps me avoid being fixated on one region of the state space or reinforcing the same action over and over. After a round of learning, I can go back to playing with my updated value function. Again, collect a bunch of experiences and then learn from them in a batch. In this way, experience replay can help us learn a more robust policy, one that is not affected by the inherent correlation present in the sequence of observed experience tuples. If you think about it, this approach is basically building a database of samples and then learning a mapping from them. In that sense experience replay helps us reduce the reinforcement learning problem or at least value learning portion of it to a supervised learning scenario. That’s clever. We can then apply other models learning techniques and best practices developed in the supervised learning literature through reinforcement learning. We can even improve upon this idea, for example, by prioritizing experience tuples that are rare or more important.

%d 블로거가 이것을 좋아합니다: