Let’s build off the grid world example and investigate how we might determine the value function corresponding to a particular policy. To begin, we’ll enumerate the states. So, state S1 is the state in the top left corner, then S2, S3, and S4. Say, we’re trying to evaluate the Stochastic Policy where the agent selects actions uniformly from the set of possible actions. For instance, in state S1, the agent either moves right or down and it essentially decides by just flipping a coin. That is, it moves right with 50 percent probability and otherwise, moves down. In state S2, it moves left half of the time and moves down the other half of the time. The same goes for state S3, where the agent moves up with 50 percent probability and otherwise moves right. So, how about we determine the state value function corresponding to this policy? Let’s begin with recording what we know about state S1. So, half the time, the agent moves right. In the event that the agent does move right, the expected return it collects can be calculated as negative one plus the value of the state S2 that it lands on. And the other half of the time, the agent moves down and the resultant expected return is just negative three plus the value of the next state S3. So then, the value of state S1 can be calculated as the average of these two values. Since the agent chooses each of the potential actions with equal probability, you might recognize this equation as just the Bellman equation evaluated at the state S1. Here we see that it let’s us express the value of state S1 and the context of the values of all of its possible successor states. Continuing with state S2, if the agent moves left, the expected return is negative one plus the value of the state S1 and if the agent moves down, the expected return is five plus the value of the terminal state S4. And to get the value of state S2, we need to just take the average of these two values. And here we’ve arrived at the Bellman equation again but now, for state S2. We can continue the trend for state S3 when we take into account that the agent can move up or right and we take the average of the two values. Finally, the value of the state S4 will always be zero since it’s a terminal state. After all, in the case that the agent starts at the state, the episode ends immediately and no reward is received. In this way, we see that we have one equation for each state and we can directly solve this system of equations to get the value of each state. And when you solve the system, you get these values where the values of state S1 and of the terminal state are both zero and each of the other two states has a value of two. For each state, we now have the expected return that’s likely to follow if the agent starts in that state and then follows the equal probable random policy. The only problem here is that typically the state space is much larger so directly solving the system of equations is more difficult. In this case, using an iterative method for solving the system generally works better. So instead of solving the system directly, what we can do is start off with a guess for the value of each state. It doesn’t have to be any good and what’s typically done is to set the value of each state to zero. Then we start by focusing our attention to one state. Say state S1. And we’d like to improve our guess for the value of the state. To do this, we’ll use the same Bellman equation from before but we’ll adopt it so it works as an update rule. Here, the capital V denotes our current guess for the value function. We’ll use this update rule to obtain a new guess for the value of state S1. The idea is that we’ll update a guess with a guess where the current guesses for the values of states S2 and S3 are used to obtain a new guess for the value of state S1. So we plug in the estimates for the values of states S2 and S3 and when we do that, we get negative two which is our new guess for the value of state S1 and will record that new value. Then, we’ll do the same for the second state where we update the guess for the value of state S2 using the value of state S1. When we plug in the value, we get one as our new estimate for the value of state S2 and we’ll keep track of that. Then onto state S3, we update the value using our current estimate for the value of state S1. Our new estimate is one which we can record. We won’t need to update the value of state S4 since it’s a terminal state so it’ll always have a value of zero. But we can loop back to the first state and continue to proceed in this way with updating our guess based on the current guess for the values of the other states. In this case, the value is updated to negative one which we can record. Then we move on to state S2, S3 and so on over and over and it turns out that this iterative algorithm yields an estimate that converges to the true value function. This is the idea behind an algorithm known as Iterative Policy Evaluation and we’ll go into much more detail soon.