So we’re working with this grid world example and looking for the best policy that leads us to a goal state as quickly as possible. So, let’s start with a very, very bad policy so that we can understand why it’s bad, and then work to improve it. Specifically, we’ll look at a policy where the agent visits every state in this very roundabout manner, and we can ignore the transition that the agent will never take under the policy. So now, towards understanding why this policy is bad, let’s calculate the cumulative reward that will result. If the agent starts in the top left corner of the world and follows this policy to get to the goal state, it just collects all of the reward along the way. So that’s negative one plus negative one, plus negative one again, and so on, where if I add up all the rewards along the way, I get negative six. Let’s say we’re not discounting or that the discount rate is one, we’ll keep track of this negative six and remember, that it represents the fact that if we start at the state at the top left corner, and then just follow the policy for all time steps, that results in a return of negative six. But now, say, instead the agent started one location over to the right. Then, what return would be likely to follow under the same policy? Again, we just sum up all the rewards that the agent receives along the way, and when we do that, we get a return of negative five, and let’s also keep track of that. We can continue and do this for every state in the world. It makes sense to think of the goal state as resulting in the return of zero. After all, if the agent starts at the goal the episode ends immediately and no reward is received. In this way, no matter where the agent starts in the world, we have a way of keeping track of the return that follows. This way of analyzing this horrible policy will help us to improve it. But before we get into exactly how to do that, let’s attach a bit of notation and terminology to this process we just followed. You can think of this grid of numbers as a function of the environment state. For each state, it has a corresponding number, and we refer to this function as the state-value function. For each state, it yields the return that’s likely to follow if the agent starts in that state and then follows the policy for all time steps, but it’s more common to see it equivalently expressed but with a bit more notation. Before I show you that notation, I warn you that it looks a bit complicated, but it’s equivalent to what we’ve already discussed. And here it is. The state-value function for a policy pi is a function of the environment state. For each state s, it tells us the expected discounted return, if the agent started in that state s, and then use the policy to choose its actions for all time steps, the state value function will always correspond to a particular policy. So if we change the policy, we change the state-value function, and we typically did note the function with the lowercase v with the corresponding policy in the subscript.