9 – TLPPO Summary V1

After learning important sampling and the clip surrogate function, we can finally summarize the PPO algorithm. First, we collect some trajectories based on some policy pi theta, and initialize theta prime to be equal to theta. Next, we compute the gradient of the clip surrogate function using the collected trajectories. Then we update theta prime … Read more

8 – PPO Part 2_ Clipping Policy Updates

In this video, we will learn how to clip the surrogate function to ensure that the new policy remains close to the old one. So, really, what is the problem of updating our policy and ignoring the fact that the approximations may not be valid anymore or the problem basically lead to a really bad … Read more

7 – PPO Part 1_ The Surrogate Function

In this video, we’ll learn how to use important sampling in the context of policy gradient, which will lead us to the surrogate function. Say we’re trying to update our policy, Pi Theta Prime. To do that, we need to estimate a gradient, g, but we only have trajectories generated by an older policy, Pi … Read more

6 – Importance Sampling

In this video, we’ll learn how to utilize important sampling to use trajectories. This will improve the efficiency of policy based methods. Let’s get started. Let’s go back to the reinforce algorithm. We start with a policy pi_theta. Then, using that policy we generate some trajectories. Afterward, using those trajectories we compute a policy gradient, … Read more

4 – Credit Assignment

In this video, we’ll learn how to modify the reward function so that we can better differentiate good versus bad actions within a trajectory. Going back to the gradient estimate, we can take a closer look at the total reward R, which is just the sum of reward at each step. Now, let’s think about … Read more

3 – Noise Reduction

Here, we will explain why the policy gradient is noisy, and discuss ways to reduce this noise. The way we optimized the policy, is by maximizing the average reward, U Theta. To do that, we use gradient ascent. Mathematically, the gradient is given by an average of the terms in the parentheses, over all the … Read more

1 – Instructor Introduction

Hi, I’m Tim. Nice to meet you all. I’m your new instructor for the deep reinforcement learning course. Before joining to Udacity team, I was a post-doctoral researcher in particle physics at UC Berkeley. In fundamental physics, if we borrow many tools for machine learning and AI to help us model complex interactions and derive … Read more