Currently your update step for Policy Evaluation looks a bit like this. You generate an episode, then for each state-action pair that was visited, you calculate the corresponding return that follows. Then, you use that return to get an updated estimate. We’re going to look at this update step a bit closer with the aim of improving it. You can think of it as first calculating the difference between the most recently sampled return, and the corresponding value of the state-action pair. We denote that by Delta T, and you can think of it as an error turn. After all, it’s the difference between what we expect the return to be, and what the return actually was. In the case that this error is positive, well that means that the return that we received is more than what the value function expected. In this case, the action value is too low, so we use this update step to increase the estimate. On the other hand, if the error is negative, then that means that the return is higher than what the actual value function expected. So it makes sense to take into account this new evidence, and decrease the estimate and the actual value function. And exactly how much do we increase or decrease the function? Well currently, the algorithm decreases it by an amount inversely proportional to the number of times that we’ve visited the state-action pair already. So the first few times we visit the pair, the change is likely to be quite large. But at future time points, where the denominator of this fraction gets quite big, the changes get smaller and smaller. To understand why this would be the case, you have to remember that this equation just calculates the average of all the sampled returns. So if you already have the average of 999 returns, then when you take into account the 1,000th return, it’s not going to change the average much. So with this in mind, we’ll change the algorithm to instead use a constant step size which I’ve denoted by Alpha here. This ensures that returns that come later are more emphasized than those that arrived earlier. In this way, the agent will mostly trust the most recent returns, and gradually forget about those that came in the past. This is quite important because remember that the policy is constantly changing, and with every step becoming more optimal. So in fact, later time steps are quite important to estimating the action values. I strongly encourage you to make this amendment to your algorithm for Monte Carlo policy evaluation.