8 – PPO Part 2_ Clipping Policy Updates

In this video, we will learn how to clip the surrogate function to ensure that the new policy remains close to the old one. So, really, what is the problem of updating our policy and ignoring the fact that the approximations may not be valid anymore or the problem basically lead to a really bad policy. Let’s see that in action. Say we have some policy parameterized by Theta prime shown on the left, with the reward function shown on the right. The current policy is labeled by the red dot and the goal is to update the policy to the optimal one, they’re both by the green star. To update the policy, you can compute a surrogate function shown by the red curve. So, this function approximates that we worked pretty well around the current policy but far away from the current policy diverges. So, if we continually update the policy by performing gradient ascent, we might get something like this and then this. The big problem here is that at some point we hit a cliff. Where the policy changes by a large amount from the perspective of the circuit function, the average reward is really great, but in reality the actual reward is actually really bad. The worst part is the policy is now stuck in a deep and a flat bottom. So that future updates won’t be able to bring the policy back up, and now we’re stuck with a really bad policy. How do we fix this? Wouldn’t it be great if we could somehow stop the gradient ascent so that are policy doesn’t fall off the cliff? Well, here’s an idea. What if we just flatten the surrogate function like this? Now, what would policy update look like then? So, starting with the current policy, now labelled by the blue, we can apply gradient ascent. The updates remain the same until we hit the plateau. Now because the reward function is flat, the gradient is zero and the policy update will stop now. So, keep in mind that right now we’re only showing a two dimensional figure with one theta prime direction, in most cases, there are thousands of parameters in the policy. So, there could be thousands of high dimensional cliffs in many different directions and applying this clipping mathematically will automatically take care of all these cliffs. How do we formalize this clipping procedure then? Well, we can write the clipped surrogate function this way like this. Now, the formula looks a little bit complicated but the idea is pretty simple. So, let’s dissect the formula by looking at one specific term in the sum and set the future reward to 1 to make things easier. First, we start with the original surrogate function. So, in the red which is basically just the ratio Pi Theta prime over Pi Theta. The black dot shows location with the current policy as the same as the old one. Now, we want to make sure that two policies are similar or that the ratio is close to 1. So, we can choose a small Epsilon, usually 0.1 or 0.2, and apply the clip function to force this ratio to be within a small Epsilon window of 1 like this. But we actually only want to clip the top part and not the bottom part of the function. To do that, we can compare the clipped function to the original one and take the minimum. Like this in blue. Now, with this minimum function, this also ensures that the clipped surrogate function is always less than the original one. This way when we try to maximize the clip surrogate function, we’re also indirectly maximizing the circuit function. So in a sense, we have a more conservative optimization procedure. Now, comparing all the curves together looks something like this. So, that’s it. The essence of the PPO algorithm is simply computing the clipped surrogate function and performing updates multiple times using gradient ascent on the clipped surrogate function.

%d 블로거가 이것을 좋아합니다: