7 – M2L3 02 V2

You may be wondering why do we need to find optimal policies directly when value-based methods seem to work so well. There are three arguments we’ll consider, simplicity, stochastic policies, and continuous action spaces. Remember that in value-based methods like Q-learning, we invented this idea of a value function as an intermediate step towards finding … Read more

6 – M3 L2 C07 V3

So far, you’ve learned about a couple of different algorithms that we can use to optimize the weights of the Policy Network. Hill climbing begins with a best guess for the weights, then it adds a little bit of noise to propose one new policy that might perform better. Steepest-ascent Hill climbing, does a little … Read more

5 – M2L3 04 V1

With an objective function in hand, we can now think about finding a policy that maximizes it. An objective function can be quite complex. Think of it as a surface with many peaks and valleys. Here it’s a function of two parameters with the height indicating the policies objective value j theta. But the same … Read more

4 – M3 L2 C05 V1

Now that we have a mental picture of how the hill climbing algorithm should work, we’re ready to dig into the pseudo-code. So remember, we begin with an initially random set of weights Theta. We’ll collect an episode with the policy that correspond to those weights, and then record the return which we’ll denote by … Read more

3 – M3 L2 C04 V3

So far, you’ve learned how to represent a policy as a neural network. This network takes the current environment state as input, then if the environment has a discrete action space it outputs the action probabilities that the agent uses to select its next action. In this video, we’ll describe an algorithm that the agent … Read more

2 – M3 L2 C02 V1

So, first things first. How might we approach this idea of estimating an optimal policy? Let’s consider this cart pole example. In this case, the agent has two possible actions, he can push the cart either left or right. So, at each time step, the agent picks one of these two options. We can construct … Read more

1 – M3 L2 C01 V2

Reinforcement learning is ultimately about learning an optimal policy from interaction with the environment. So far, we’ve been looking at value-based methods, where we first tried to find an estimate of the optimal action value function. For small state spaces, this optimal value function was represented in a table with one row for each state … Read more