7 – Truncated Policy Iteration

Congratulations! You’ve implemented your first algorithm that can solve an MDP. For the remainder of this lesson, we’ll look at some variations of this algorithm, and you’ll have the chance to implement all of them to compare their performance. We’ll begin by looking at this policy evaluation step. Remember that policy evaluation is an iterative algorithm where we rely on a Bellman update rule. The number of iterations needed for a single round of policy evaluation is decided according to some small positive number theta that you set. And the closer you wanted your approximate state by your function to the true value function, the smaller you had to make this hyperparameter theta. But now the question is, can we sacrifice some accuracy here? After all, who’s to say how long this policy evaluation algorithm should take. Smaller theta means it will take longer, but exactly how much longer will depend on your MDP. Maybe, instead of terminating the algorithm with this stopping criterion, we can give an absolute number of iterations that were willing to calculate. So we could modify this policy evaluation step to instead terminate after updating the values of each of the states, maybe once, twice, or some positive integer number of times. The idea is that if our goal is to get an optimal policy, we aren’t prevented from doing that by having a value function that’s a little bit off, so we don’t need to wait for a really good estimate of the value function before calculating an improved policy. For instance, say this is the optimal actual value function for a hypothetical MDP. One optimal policy then can be obtained by looking at each state separately, and then picking the action that maximizes the function. But say we don’t have this optimal action value function, say instead we’re working with a wildly incorrect estimate, where all the values are way off, but the relative values between many state action pairs are correct. Then if we use this technically incorrect value function to obtain a policy, that policy will in fact be the same optimal policy, but that’s just the main idea. We don’t need to have a perfect or even near perfect estimate of the value function to get an optimal policy.

%d 블로거가 이것을 좋아합니다: