9 – TLPPO Summary V1
After learning important sampling and the clip surrogate function, we can finally summarize the PPO algorithm. First, we collect some trajectories based on some policy pi theta, and initialize theta prime to be equal to theta. Next, we compute the gradient of the clip surrogate function using the collected trajectories. Then we update theta prime … Read more
댓글을 달려면 로그인해야 합니다.