So far we’ve informally discussed how we might take a bad policy like the equiprobable random policy, use it to collect some episodes and then use those episodes to come up with a better policy. Central to this idea, we build a table that stores the return obtained from visiting each state action pair and then this table can be used to obtain a policy that’s better than the one we started with. In practice, this table is an estimate for how much return is likely to follow if the agent starts in a state selects an action and then uses the policy to select all future actions. Does this is sound familiar? Well, let’s recall the definition of the action value function for a policy. It specifies for each state action pair the expected return that’s likely to follow if the agent starts in that state, selects that action and then henceforth, follows the policy. What’s important to note is that’s exactly what we estimate with this table. In particular, we use the table to estimate the action value function corresponding to the equiprobable random policy. Remember that we denote the action value function with a lowercase q and it’s popular to refer to this table as a Q table with a capital Q. Now, this estimate isn’t that good, but the more samples we get the better our estimate will become.