8 – Value Iteration

So we have talked about policy iteration. We have also learned about truncated policy iteration. In this case, the policy evaluation step is permitted only a limited number of sweeps through the state space. In other words, we limit the number of times that the estimated value of each state is updated before proceeding to the policy improvement round. In this video, you will learn about Value Iteration, where the policy evaluation step is stopped after a single sweep. It is really as simple as that, but it turns out that you can simplify the underlying code, and we will derive the simple form together over the remainder of this video. So let us write down the pseudo code corresponding to each of these evaluation and improvement steps. And the first thing to notice is that we can collapse these lines to remove a bit of redundancy. In particular, the action value from the first line can be plugged in where it appears in the second line. Now recall that this policy improvements step is followed by another policy evaluation step. Pi prime is used as the new value of policy pi, which is then put through a single sweep of policy evaluation. Okay, so we can quickly see another bit of redundancy here. Instead of updating pi prime and then immediately using it to set the new value of policy pi, what we can do is just directly update the value of policy pi from the value function. Next, we’ll need to look at these two equations a bit closer. And when we do that, we notice that they’re nearly the same. In particular, everything that appears after the sum is nearly identical, with the exception of the values here. It turns out that we can combine these two lines to produce a single line that accomplishes the same effect. Towards that, we’ll first note that pi of S is the ARG Max over all actions of some semi-complicated expression, but we can think of this long thing as a function of the action A. Then, V of S is a sign that same function, but evaluated at pi of S. So if pi of S is set to the ARG Max and V of S this is set to the function evaluated at that ARG Max, well, then we could just directly set V of S to the max of the function until we can plug this in to our algorithm. We begin with an initial guess of the value function. The algorithm loops over the state space and successively applies the update rule to get a better guess at the value function. After each update, we check to see if the value function estimate has converged. And in case it has, we stop this part of the algorithm. After convergence, there is one last step to get the policy corresponding to the final value function. Once you are ready, follow the instructions in the next concept to implement this algorithm yourself.

%d 블로거가 이것을 좋아합니다: