7 – PPO Part 1_ The Surrogate Function

In this video, we’ll learn how to use important sampling in the context of policy gradient, which will lead us to the surrogate function. Say we’re trying to update our policy, Pi Theta Prime. To do that, we need to estimate a gradient, g, but we only have trajectories generated by an older policy, Pi Theta. How do we compute the gradient then? Mathematically, we could utilize important sampling, and the answer is just what a normal policy gradient would be, times a re-weighting factor. We can rearrange these equations, and the re-weighting factor is just the product of all the policy across each step. Notice that I’ve picked out the parts at time step t here. We can rearrange the equation a little bit more, and notice that we can cancel the terms on the left, but still we are left with a product of policies at different times, denoted by the dot dot dot. Can we somehow simplify this expression further? Well, this is where the idea of proximal policy comes in. If the old and current policy is close enough to each other, all these factors will be pretty close to one. Then perhaps we can ignore them. So, now the equation simplifies even further. It looks very similar to the old policy gradient. In fact, if the current policy is the same as the old policy, we would have exactly the same vanilla policy gradient, but remember, this expression is different because we are comparing two different policies. Now that we have the approximate form of the gradient, we can think of it as the gradient of a new object called the surrogate function. So, using this new gradient, we can perform gradient ascent to update our policy, which we can think of as directly maximizing the surrogate function. But there’s still one important issue we have not addressed yet. If we keep re-using old trajectories and updating our policy, at some point the new policy might become different enough from the old one, so all the approximations we made could become invalid. We need to find a way to make sure this doesn’t happen. Let’s see how in part two.

%d 블로거가 이것을 좋아합니다: