# 5 – Policy Improvement

I hope you enjoyed implementing iterative policy evaluation in the first part of the mini project. Feel free to use your algorithm to evaluate a policy for any finite MDP of your choosing. You need not confine yourself to the Frozen Lake environment. Just remember that policy evaluation requires the agent to have full knowledge of the MDP. Then if you like, you can use the second part of the mini project to obtain the corresponding action value function. But for now, let’s continue our search for an optimal policy. Policy evaluation gets us partially there. After all, in order to figure out the best policy, it helps to be able to evaluate candidate policies. In this video, you’ll learn about policy improvement, an algorithm that uses the value function for a policy in order to propose a new policy that’s at least as good as the current one. But why would we want to do that? Well, immediately we can see how policy evaluation and policy improvement could go quite nicely together. Policy evaluation takes a policy and yields the value function. Then we can take that same value function and use policy improvement to get a potentially new and improved policy. Then it’s possible to take that new policy, plug it in to do policy evaluation again, then policy improvement over and over until we converge to an optimal policy. Let’s try to understand this better in the context of the grid world from earlier in the lesson. Remember that this was an episodic task where the only terminal state was the bottom right hand corner. We assume that the agent already knows everything about the environment. So it knows how transitions happen and how reward is decided. And in particular, it does not need to interact with the environment to get this information, but it still needs to determine the optimal policy. So say the agent starts off with an initial guess for the optimal policy. In a previous video, we saw that it made sense to start with the equal probable random policy, where in each state the agent randomly picks from a set of available actions. So based on this idea, the agent calculates the corresponding value function through iterative policy evaluation. We did this in an earlier video, where we saw that this was the value function. So now the question becomes, how might we design this policy improvement step to find a policy that’s at least as good as the current one? We’ll break policy improvement into two steps. The first step is to convert the state value function to an actual value function. You’ve already seen how to implement this. And when we only follow that same procedure on this small grid world, we get this action-value function. So now we need only focus on how to use this action-value function to obtain a policy that’s better than the equal probable random policy. So here’s the idea. For each state, we’ll just pick the action that maximizes the action-value function. So beginning with the state in the top left corner, one is greater than negative one, so the policy will be to go right, instead of down. Then five is greater than negative one, so the policy goes down here. And again, five is greater than negative one, so the policy goes right. You might be wondering at this point what would happen if we encountered a state with multiple choices of actions that maximize the action-value function. In this case you have options. You could arbitrarily pick one or construct a stochastic policy that assigns non-zero probability to any or all of them. We’ll go more into this soon. But let’s dig deeper into exactly why this should work. Remember that the value function corresponding to a policy just stores the expected return if the agent follows the policy for all time steps. And when we calculate the action-value function, we’re looking at what would happen if the agent at some time step t chose an action a and then henceforth followed the policy. And our method for constructing the next policy pi_prime looks at this action-value function and for each state determines that first action a that’s best for maximizing return. In this way, it follows that whatever expected return was associated to following the old policy pi for all time steps, the agent has higher expected return If it initially follows policy pi_prime and then henceforth follows policy pi. That said, it’s possible to prove not only is policy pi_prime better to follow for the first time step, it’s also better to follow for all time steps. In other words, it’s possible to prove that policy pi_prime is better than or equal to policy pi. And remember in order to prove this, we need to show that the value function for policy pi_prime is greater than or equal to the value function for policy pi for all states. If you’d like to see how to do this in more detail, you’re encouraged to check out the optional reading in the textbook. For now we can plug this into our algorithm for policy improvement where as you can see, the idea is that you’ll first calculate the action-value function from the state value function. Then, in order to construct a better policy for each state, we just need to pick one action that maximizes the action-value function. You’ll have the chance to implement this algorithm yourself soon.