8 – Optimal Policies

Several concepts ago, I’ve mentioned that we needed to define the action value function before talking about how the agent could search for an optimal policy, and we will see most of the detail for a later lesson. The main idea is this. The agent interacts with the environment. And from that interaction, it estimates the optimal action value function. Then, the agent uses that value function to get the optimal policy. So this might all seem quite strange, but it will become much clearer in the next lesson when you implement this process yourself. So for now, please bear with me and let’s further ignore the question of how the agent uses its experience to estimate the value function. In particular, let’s assume it already knows the optimal action value function, but it doesn’t know the corresponding optimal policy. So how does it get the optimal policy? This is what we’ll explore in this video. So we already have the optimal action value function and you’ve seen some of the optimal policies already, but I’ve removed those hints here, so let’s try to reconstruct an optimal policy from the value function. It’s possible to show that for each state, we just need to pick the action that yields the highest expected return. So beginning with the state in the top left corner, the policy will go right instead of down since 2 is larger than zero. Moving right, we see two values of 1 and one value of 3. 3 is the largest value into the policy, we’ll go right here. And we can continue in this way, always picking the action with the highest value. So 4 is greater than 2, 5 is higher than 1 or 3. Next, 4 is the largest. We’ll skip over the state with 3 values of 1 because it’s not quite clear what to do here. But then 2 is larger than 0, and 5 is larger than 1. So now back to the state with three values of 1. It turns out that to construct the optimal policy, we have our choice here. The agent could go up down or right and all three choices would yield optimal policies. So let’s just say the policy decides to go right. And just like that, we’ve arrived at an optimal policy and it’s worth summarizing what we’ve noticed here. If the agent has the optimal action value function, it can quickly obtain an optimal policy, which is the solution to the MDP that we’re looking for. This brings us to the question of how the agent could find the optimal value function. This is in fact what we’ll explore for the remainder of this course.

%d 블로거가 이것을 좋아합니다: