Say the agent interacts with the environment for four episodes, then say we focus on one state-action pair in particular which was visited in each episode, and we can record the return obtained by visiting that pair for each episode. So in episode one, after the pair was visited, the agent got a return of two. In episode two after the same pair was visited, the agent got a return of eight and so on. Until what we currently do is getting an estimated action value only after all of the episodes have finished. But what you are going to learn in this video, is how to efficiently, estimate the action values after each episode. So let’s clear the estimated values and we’ll build them gradually. Now the action value corresponding to the first episode is pretty simple. It’s just the return that was obtained which is two. Next the action value for the second episode is just the average of two and eight or five. Then the average of 2,8 and 11 is seven, and the average of 2,8,11 and 3 is six. So how can we populate these estimated action values in a computational way efficient way? Well, there’s a really useful formula that I’ve pasted at the bottom of this slide. If you’re interested in examining this equation in more detail, then please check up the optional instructor notes below the video. But the way this equation works, is it calculates the estimated action value from the previous estimate, and the most recently sampled return. We’d also have to keep track of the number of times we’ve visited the state-action pair. So in this case, we can plug in the values, and when we simplify all of that, we get five, an updated estimate for the action values. Moving on to the next episode, we can get the next estimated action value by just plugging in all of the values, and when we simplify, we get seven, and we can recover the original estimate from just plugging in the values to get six. So, using this small example, we’ve seen that after each episode, we can calculate a new action value estimate, from the old action value estimate, the most recently sampled return, and the total number of visits to the state-action pair. As we’ve discussed and as you’ll see in the pseudo-code below, performing the updates after each episode, will allow us to use the Q-table to update the policy after each episode, which will make our algorithm much more efficient.