6 – Policy Iteration

At this point in the lesson, you’ve used policy evaluation to determine how good a policy is by calculating its value function. You’ve also used policy improvement which uses the value function for a policy to construct a new policy that’s better than or equal to the current one. I mentioned that it will make sense to combine these two algorithms to produce an algorithm that successively proposes better and better policies. The name for the algorithm that combines these two steps is policy iteration and it’s our current focus. The algorithm begins with an initial guess for the optimal policy. It makes sense to begin with the equal probable random policy where for each state each action is equally likely to be chosen. Then we’ll use policy evaluation to obtain the corresponding value function. Next, we’ll use policy improvement to obtain a better or equivalent policy. Then we just repeat this loop over and over with policy evaluation and then policy improvement until finally we encounter a policy improvement step that doesn’t result in any change to the policy. And, what’s great is that in the case of a finite MDP, we have guaranteed convergence to the optimal policy. In the next concept, you’ll have the chance to combine all the code you’ve already written to finally help your agent use policy iteration to obtain an optimal policy.

%d 블로거가 이것을 좋아합니다: