So far in this lesson, we have discussed how the agent can take a bad policy, like the equiprobable random policy, use it to collect some episodes, and then consolidate the results to arrive at a better policy.
In the video in the previous concept, you saw that estimating the action-value function with a Q-table is an important intermediate step. We also refer to this as the prediction problem.
Prediction Problem: Given a policy, how might the agent estimate the value function for that policy?
We’ve been specifically interested in the action-value function, but the prediction problem also refers to approaches that can be used to estimate the state-value function. We refer to Monte Carlo (MC) approaches to the prediction problem as MC prediction methods.
As you have learned in the videos, in the algorithm for MC prediction, we begin by collecting many episodes with the policy. Then, we note that each entry in the Q-table corresponds to a particular state and action. To populate an entry, we use the return that followed when the agent was in that state, and chose the action. In the event that the agent has selected the same action many times from the same state, we need only average the returns.
Before we dig into the pseudocode, we note that there are two different versions of MC prediction, depending on how you decide to treat the special case where – in a single episode – the same action is selected from the same state many times. For more information, watch the video below.
As discussed in the video, we define every occurrence of a state in an episode as a visit to that state-action pair. And, in the event that a state-action pair is visited more than once in an episode, we have two options.
Option 1: Every-visit MC Prediction
Average the returns following all visits to each state-action pair, in all episodes.
Option 2: First-visit MC Prediction
For each episode, we only consider the first visit to the state-action pair. The pseudocode for this option can be found below.
Soon, you’ll have the chance to implement this algorithm yourself!
You will apply your code to OpenAI Gym’s BlackJack environment. Note that in the game of BlackJack, first-visit and every-visit MC return identical results!
First-visit or Every-visit?
Both the first-visit and every-visit method are guaranteed to converge to the true action-value function, as the number of visits to each state-action pair approaches infinity. (So, in other words, as long as the agent gets enough experience with each state-action pair, the value function estimate will be pretty close to the true value.) In the case of first-visit MC, convergence follows from the Law of Large Numbers, and the details are covered in section 5.1 of the textbook.
If you are interested in learning more about the difference between first-visit and every-visit MC methods, you are encouraged to read Section 3 of this paper. The results are summarized in Section 3.6. The authors show: