9 – TLPPO Summary V1

After learning important sampling and the clip surrogate function, we can finally summarize the PPO algorithm. First, we collect some trajectories based on some policy pi theta, and initialize theta prime to be equal to theta. Next, we compute the gradient of the clip surrogate function using the collected trajectories. Then we update theta prime using gradient ascent. After that we can repeat step two and three, re-computing the clip surrogate function and updating the policy again and again, without generating new trajectories. Although typically step two to three are only repeated a few times. Then we go back to step one generating more trajectories and repeat all the previous steps, and that’s it. Now that you’ve learned what Proximal Policy Optimization is, let’s put it in action and implemented it.

%d 블로거가 이것을 좋아합니다: