In our current algorithm for Monte Carlo control, we collect a large number of episodes to build the Q-table (as an estimate for the action-value function corresponding to the agent’s current policy). Then, after the values in the Q-table have converged, we use the table to come up with an improved policy.
Maybe it would be more efficient to update the Q-table after every episode. Then, the updated Q-table could be used to improve the policy. That new policy could then be used to generate the next episode, and so on.
So, how might we modify our code to accomplish this? Watch the video below to see!
In this case, even though we’re updating the policy before the values in the Q-table accurately approximate the action-value function, this lower-quality estimate nevertheless still has enough information to help us propose successively better policies. If you’re curious to learn more, you can read section 5.6 of the textbook.
The pseudocode can be found below.
here are two relevant tables:
- $Q$ – Q-table, with a row for each state and a column for each action. The entry corresponding to state $s$ and action $a$ is denoted $Q(s,a)$.
- $N$ – table that keeps track of the number of first visits we have made to each state-action pair.
The number of episodes the agent collects is equal to num_episodes.
The algorithm proceeds by looping over the following steps:
- Step 1: The policy $\pi$ is improved to be $\epsilon$-greedy with respect to $Q$, and the agent uses $\pi$ to collect an episode.
- Step 2: $N$ is updated to count the total number of first visits to each state action pair.
- Step 3: The estimates in $Q$ are updated to take into account the most recent information.
In this way, the agent is able to improve the policy after every episode!