6 – Importance Sampling

In this video, we’ll learn how to utilize important sampling to use trajectories. This will improve the efficiency of policy based methods. Let’s get started. Let’s go back to the reinforce algorithm. We start with a policy pi_theta. Then, using that policy we generate some trajectories. Afterward, using those trajectories we compute a policy gradient, g, and update theta to theta prime. Now, we have a new policy pi_theta prime. At this point, the trajectories we’ve just generated are simply thrown away. If we want to update our policy again, we would need to generate new trajectories once more using the updated policy. You might ask, “Why is all this necessary?” Well, is because we need to compute the gradient for the current policy, and to do that the trajectories need to be representative of the current policy. But this sounds a little wasteful. What if we could just somehow recycle all trajectories by modifying them, so that they are representative of the new policy. So, instead of just throwing them away, we recycle them. Then, we could just reuse the recycled trajectory to compute gradients, and to update our policy again and again, which will make updating the policy a lot more efficient. So, how exactly would that work? This is where importance sampling comes in. Let’s look at the trajectories we generated using the policy pi_theta. It had a probability p of Tau theta to be sampled. Now, just by chance the same trajectory can be sampled under the new policy, but with a slightly different probability p of Tau theta prime. Imagine we want to compute the average of some quantity say f of Tau, under the new policy. The easiest way we could do this is to simply generate trajectories from the new policy, then compute f of Tau and average them up. Mathematically, this is equivalent to adding a body f of Tau over all possible trajectories, weighted by probability factor, p of Tau theta prime. Now, we can modify this equation by multiplying and dividing by the same number p of Tau theta, and then we can rearrange the terms like this. Now, it doesn’t look like we’ve done much, but written in this way we can reinterpret the first part as the coefficient of sampling under the old policy. With an extra re-weighting factor, this term, in addition to f of Tau. Intuitively, this tells us we can use all trajectories for computing averages for the policies, as long as we add this extra re-weighting factor, which basically takes into account how under or over-represented each trajectory is under the new policy compared to the old one. The same trick, has been used frequently across statistics, where the re-weighting factor is included to unbias surveys and voting predictions. So, in order that we use samples for computing gradients for new policy, all we need to do is include this extra re-weight factor. Now, let’s take a closer look at the re-weighting factor, because each trajectory contains many steps. The probability contains a chain of products of the policy at each step, like this. This formula is a little bit complicated, but there’s actually a bigger problem. You see when some of the policy gets close to zero, the re-weighting factor can easily become very close to zero, or worse very close to one over zero, which diverges to infinity. When this happens, the re-weighting trick becomes very unreliable. So, in practice, we always want to make sure that the re-weighting factor is not too far from one, when we utilize important sampling.

%d 블로거가 이것을 좋아합니다: