So far we’ve been working with the state value function for a policy. For each state s, it yields the expected discounted return If the agent starts in state as and then uses the policy to choose its actions for all time steps. You’ve seen a few examples and know how to calculate the state value function corresponding to a policy. In this concept, we’ll define a new type of value function known as the action-value function. This value function is denoted with a lowercase q instead of v. While the state values are a function of the environment state, the action values are a function of the environment state and the agent’s action. For each state s and action a, the action value function yields the expected discounted return If the agent starts in state s then chooses action a and then uses a policy to choose its actions for all future time steps. Just like with the state value function, it will help your intuition If you calculate this yourself. In the case of the state value function, we kept track of the value of each state with the number on the grid. We’ll do something similar with the action-value function, where we now need up to four values for each state, each corresponding to a different action. These four numbers correspond to the same state but the one on top corresponds to action up; the one on the right, corresponds to moving right before following the policy and so on. You’ll see this soon. So with the exception of the terminal state, I’ve broken up the figure to leave a space for keeping track of the value corresponding to each possible state and action. And let’s see if we can calculate some action values. We’ll begin with the state here. Let’s calculate the value corresponding to this state and action down. So the agent starts in this state, takes action down and receives a reward of negative one. Then for every time step in the future, it just follows the policy until it reaches the terminal state. We can then add up all the rewards that it encountered along the way and when we do that, we get zero. So this zero corresponds to the action value for this state where the agent started and action down. Let’s try calculating another actual value, this time corresponding to this state and action up. So the agent starts in the state and takes action up, and it received a reward of negative one. Then it just follows a policy for all future time steps. We add up the reward it collected along the way and this yields a cumulative reward of one. So this one corresponds to the action value for this state and action up. So you can continue and do this for every state-action pair that makes sense. And when you do this, you get this action-value function. I highly encourage you to calculate and check these values yourself. Before moving on, it’s important to revisit some information that you learned earlier. Remember that we have the notation v_star to refer to the optimal state value function. Similarly, we’ll refer to the optimal action value function as q_star.