# 1 – M3L3 C01 V3

Hello and welcome to this lesson on policy gradient methods. In the previous lesson, you learned all about policy-based method. Remember, policy-based methods are a class of algorithms that search directly for the optimal policy without simultaneously maintaining value function estimates. You learned how to represent the policy as a neural network, and in that setting, the agent’s goal is to find the best ways to maximize expected return. Policy gradient methods are a subclass of policy-based methods that estimate the weights of an optimal policy through gradient ascent. In this lesson, you’ll learn all about the theory behind this before training your own agent with the policy gradient method. For now, to set the stage, we’ll explore an example that will help us understand how these methods will work. So, first, remember the cart-pole example. This is a classic benchmark task where the goal is to balance a poll on a moving cart. To solve it, we can represent the policy with a neural network that takes the state as input. As output, it returns the probability of each potential action and then the agent can sample from those probabilities to select an action. But now let’s consider a much more challenging task where the goal is to teach a chicken to cross a road. At each time step, our chicken agent can move up, down, left, or right, and the goal is to get there safely. So, we have to avoid getting hit by a car or truck. Say there’s a time limit of five seconds, and if we make it safely to the other side in time, we win. Otherwise, we lose. So, just as with this cart-pole example, we can represent the agent’s policy with a neural network. It takes the game state as input and returns the probability that the chicken selects each possible action. So, in this case, the output layer will have four notes because there are four possible actions. Now, if we’re training our agent to learn from raw pixels, probably, a convolutional neural network is the best bet. But we’ll work with this simpler diagram that doesn’t pick an architecture or network type. It’s only meant to communicate that the state of the game is passed as input to a neural network. Maybe it’s a convolutional neural network and maybe it’s not. We’ll discuss a method that’s flexible, and we’ll work with either. As output, the network returns action probabilities that the agent uses to select its next action. As in the previous lesson, our goal is to find the weights of the neural network that yield the optimal policy. We’ll begin with an initially random set of weights and use the corresponding policy to interact with the environment. So, for instance, say the agent plays the game for a single round or episode and ends up making it to the other side safely and within the time limit. But then, when it plays the game for another episode, it chooses an unwise series of actions that leads to it losing the round. Then, in order to teach the agent to win, we’ll give a reward of positive one if it won and negative one if it lost and see that reward is only delivered at the end of the game. Now, so far, everything that we’ve discussed is along the lines of what we did in the previous lesson when we talked about several policy-based methods. The difference, which we’ll talk about in the next video, arises in how exactly the agent will use this reward signal to learn the weights of the optimal policy.