8 – L611 Greedy Policies RENDER V4

So far, you’ve learned how an agent can take a policy like the equal probable random policy, use that to interact with the environment, and then use that experience to populate the corresponding Q-table, and this Q-table is an estimate of that policies action value function. So, now the question is how can we use this in our search for an optimal policy? Well, we’ve already seen that to get a better policy, that’s not necessarily and actually probably not the optimal one, we need only select for each state the action that maximizes the Q-table. Let’s call that new policy Pi-prime. So, consider this, what if we replaced our old policy with this new policy, then estimated its value function, and then use that new value function to get a better policy, and then continued alternating between these two steps over and over until we got successively better and better policies in the hope that we converge to the optimal policy? It turns out that unfortunately, this won’t work as it stands now, but we have almost all the tools to make this work. There’s really just one thing that we have to fix. To discuss this fix, we have to introduce a bit of terminology. When we take a Q-table and use the action that maximizes each row to come up with the policy, we say that we are constructing the policy that’s greedy with respect to the Q-table, and that has some special notation that you can see on the bottom of this slide. We’ll plug this in to the loop we started with where the only thing that’s changed is the notation is a bit fancier. We still begin with a starting policy, estimated value function, then get a new policy that’s greedy with respect to the value function. So, then we have a new policy and so on. Again, this proposed algorithm is so close to giving us the optimal policy, as long as we run it for long enough. But to fix it, we’ll need to slightly modify the step or reconstruct the greedy policy. We’ll discuss this in the next video.

%d 블로거가 이것을 좋아합니다: