So far we’ve been working with a simple Grid World example with four states. We assume the agent used the equiprobable random policy to interact with the environment. The agent collected two episodes and now the question is, how exactly should the agent consolidate this information towards its goal of obtaining the optimal policy? Well let’s see. The real question we want to ask is, for each date, which action is best? To answer that maybe it makes sense to look at each date separately. To see this, let’s look at state two. In the first episode, we decided to go left and then the sum of the rewards collected from that point onward was eight. Then, we decided to go right and got a return of nine. Then, in the next episode, we also decided to go up and ended up with a return of eight. When the agent selected to go right still in the same state it got a return of nine. So it looks like since action right gives us more return than up or left, it should be better to go right, and this is the motivating idea behind the algorithm we’ll discuss in this lesson. So far, this information was easy to get just because the problem is so small, but generally, we’ll want to work with much bigger tasks with tons of actions and states, and so we need to formalize this. So what we’ll do is keep track of a table with one row for each non-terminal state and one column for each action. Then, for each episode what we can do is store the return we got from selecting each action in each state. So for state two when we selected action left, we got a return of eight, and when we selected action up we also got a return of eight. In the event that there are multiple episodes where we selected the same action from a state we’ll just take the average of them. So in this case in state two, we got a return of nine and a return of nine and so the average of nine and nine is nine. So this row is currently incomplete, but as long as we collect more episodes we can guarantee that we’ll eventually have a value here for action down, and we’ll fill in the table for the remaining state action pairs. Again, if we collect more episodes we can populate the rest and. Once we’ve filled in the table it makes sense that when the agent is trying to figure out which actions are the best for each state, it can refer to this table and just take the actions that maximize each row. So for state one it looks like action up is best and in state two we should select action right, and in state three action up. So all of this yields this proposed policy which we’ll denote by pi prime to distinguish it from our original policy pi. Now, as you can see, this is not the optimal policy, but what it does give us is a policy that is in some ways better than the random policy and the small step with taking the random policy, using it to interact with the environment, and constructing a better policy will turn out to be quite important.