6 – Optimality

So far in this lesson, we’ve looked at a particular policy Pi, and calculated its corresponding value function. In the quiz, you calculated the value function corresponding to a different policy which we denoted by Pi-Prime. And if you look at each of these value functions, you may notice a pattern or trend. Take the time now to compare them and pause the video if you like. So in reality, there are probably a great number of patterns in these numbers, but the most relevant trend to us now is that when we look at any state in particular and compare the two value functions, the value function for a Pi-Prime is always bigger than or equal to the value function for policy Pi. For instance, 2 is greater than negative 6, 3 is greater than negative 5, 4 is greater than negative 4, and 1 is equal to 1. So this says, for any state in the environment, it’s better to follow policy Pi-prime, right? Because no matter where the agent starts in the grid world, they expected discounted return is larger, and remember that the goal of the agent is to maximize return. So, a greater expected return makes for a better policy. This motivates an important definition. By definition, we say that a policy Pi-Prime is better than or equal to a policy Pi if it’s state-value function is greater than or equal to that of policy Pi for all states. And so there are a couple of important things to note about this definition. The first is that if you take any two policies, it’s not necessarily the case that you’re going to be able to decide which is better. In other words, it’s possible that they can’t be compared. But said, there will always be at least one policy that’s better than or equal to all other policies. We call this policy an optimal policy, and it’s guaranteed to exist but it may not be unique, and it’s important to note that an optimal policy is what the agent is searching for. It’s the solution to the MDP and the best strategy to accomplish it’s goal. Finally, all optimal policies have the same value function which we denote by Vstar. You might be wondering why it’s not written, V sub Pi star. The answer is by convention, and probably because it just looks nicer this way. On that note, it turns out that the policy from the quiz is actually an optimal policy. This is because if you compare it to the value function for any other possible policy, it’s value function is always at least as big, but it’s not the only optimal policy. For instance, here’s another policy that has the same value function, and you might be wondering at this point how I found these optimal policies. Well, this example was simple enough that it was possible to make an educated guess just by staring at the dynamics, but this will often not be the case as many of the MDPs will look at will be far more complicated. So how might we determine an optimal policy for a much more complicated MDP? To answer this question, we’ll need to define another type of value function.

%d 블로거가 이것을 좋아합니다: