# 3 – M3 L2 C04 V3

So far, you’ve learned how to represent a policy as a neural network. This network takes the current environment state as input, then if the environment has a discrete action space it outputs the action probabilities that the agent uses to select its next action. In this video, we’ll describe an algorithm that the agent can use to gradually improve the network weights. So, remember that the agent’s goal is always to maximize expected return. Let’s denote the expected return as capital J. We’ll refer to the set of weights in the neural network as Theta and there’s some mathematical relationship between Theta and the expected return J. This is because Theta encodes the policy which makes some actions more likely than others, and then depending on the action that influences the reward and then we sum up the rewards to get the return. Actually, it sounds potentially quite complicated. But the main idea is that it’s possible to write the expected return J as a function of Theta and that function looks like this. Now, this equation shouldn’t make any sense to you yet, and we’ll come back to it later in the course. For now, just know that it exists, and our goal is to find the values for Theta, or the values for the weights that maximize J. To understand this better, let’s draw a picture. Consider the case that the neural network has only two weights. Theta zero and Theta one. Then, we can plot the expected return J as a function of the values of both of the weights. For each weight gets a different axis, and once we have that function we can use it to find the value of Theta or the values of both of the weights that maximize expected return. Maybe you’re familiar with one algorithm that could be useful for this called gradient ascent. Gradient ascent begins with an initial guess for the value of Theta that maximizes the function, then we evaluate the gradient at that point. Remember that the gradient points in the direction of steepest increase of the function, and so we can make a small step in that direction. In the hope that we end up at a new value of Theta or the value of the function is a little bit larger. Then we repeat with evaluating the gradient and taking a step until we eventually reach the maximum of the function. Now in order to do this, we have to take the gradient of the function. We’ll save that for later in the course and for now we will discuss a different algorithm called hill-climbing, that’s a little bit simpler. As with gradient ascent, we begin with an initially random set of weights Theta. We collect a single episode with the policy that corresponds to those weights and then record the return. This return is an estimate of what the surface looks like at that value of Theta. Now, it’s not going to be a perfect estimate because the return we just collected is unlikely to be equal to the expected return, but in practice the estimates often turns out to be good enough. Then, we’ll add a little bit of random noise to those weights to give us another set of candidate weights we can try. To see how good those new weights are, we’ll use the policy that they give us to again interact with the environment for an episode and add up the return. In up the new weights, give us more return than our current best estimate, we focus our attention on that new value, and then we just repeat by iteratively proposing new policies in the hope that they outperform the existing policy. In the event that they don’t, well we just go back to our last best guess for the optimal policy and iterate until we end up with the optimal policy. Later in this lesson, you’ll work with an implementation of this algorithm to see it working for yourself.