# 7 – M2L3 02 V2

You may be wondering why do we need to find optimal policies directly when value-based methods seem to work so well. There are three arguments we’ll consider, simplicity, stochastic policies, and continuous action spaces. Remember that in value-based methods like Q-learning, we invented this idea of a value function as an intermediate step towards finding the optimal policy. It helps us rephrase the problem in terms of something that is easier to understand and learn. But if our ultimate goal is to find that optimal policy, do you really need this value function? Can we directly estimate the optimal policy? What would such a policy look like? If we go with a deterministic approach, then the policy simply needs to be a mapping or function from states to actions. And in the stochastic case, this would be the conditional probability of each action given a certain state. We would then choose an action based on this probability distribution. This is simpler in the sense that we are directly getting to the problem at hand, but it also avoids having to store a bunch of additional data that may or may not always be useful. For example, large portions of the state space may have the same value. Formulating the policy in this manner allows us to make such generalizations where possible and focus more on the complicated regions of the state space. One of the main advantages of policy-based methods over value-based methods is that they can learn true stochastic policies. This is like picking a random number from a special kind of machine. One, where the chances of each number being selected depends on some state variables that can be changed. In contrast, when we apply epsilon-greedy actions selection to a value function, that does add some randomness, but it’s a hack. Flip a coin, if it’s heads follow a deterministic policy, hence pick an action at random. The underlying value function can still drive us towards certain actions more than others. Let’s see how this can be problematic. Say you’re learning to play rock-paper-scissors. Scissors cut paper, paper covers rock, and rock break scissors. Your opponent reveals their move at the same time as you. So you can’t really use that to decide what to pick. In fact it turns out that the optimal policy here is to choose an action uniformly at random. Anything else, like a deterministic policy or even a stochastic policy, with some non-uniformity can be exploited by the opponent. Another situation where a stochastic policy helps is when we have aliased states, that is, two or more states that we perceive to be identical but are actually different. In this sense, their environment is partially observable, but such situations can arise more often than you think. Consider this grid world for instance, which consists of smooth, white, and rough gray cells, and there is a banana in the bottom metal cell and a chili, each in the bottom-left and right corners. Obviously, agent George needs to find a reliable policy to get to the banana and avoid landing on the chilies no matter what cell he starts in. But here’s the catch, all he can sense is whether the current cell is smooth or rough and whether it has a wall on each side. Think of this as the only observations or features that are available from the environment. He can’t sense anything about neighboring cells either. When George is in the top middle cell, he can sense it’s smooth and only has a single wall to the north. So he can reliably take a step downwards and reach the banana. When he isn’t either of the top left or right cells, he can sense the sidewalls. So he can recognize which of the two extreme cells he is in and can learn to reliably step away from the chilies and toward the center. The trouble is, when he’s in one of the rough gray cells, he can’t tell which one he’s in, all the features are identical. If he’s using a value function representation, then the value of these cells is equal, since they both map to the same state and the corresponding best action must also be identical. So, depending on his experiences, George might learn to either go right or go left from both the cells, resulting in an overall policy like this, which is fine. Except for this region, George will keep oscillating between these two cells and never get out. With a small epsilon-greedy policy, George might be able to come out by chance, but that’s very inefficient and could take indefinitely long. And if he kept a high epsilon, that might result in bad actions in other states. We can clearly see that the other value-based policy would not be ideal either. The best he can do is to assign equal probability to moving left or right from these aliased states. He’s much more likely to get out of the traps soon. A value-based approach tends to learn a deterministic or near deterministic policy, whereas a policy-based approach in this situation can learn the desired stochastic policy. Our final reason for exploring policy-based methods is that they are well-suited for continuous action spaces. When we use a value-based method even with a function approximator, our output consists of a value for each action. Now, if the action space is discrete and there are a finite number of actions, we can easily pick the action with the maximum value. But if the action space is continuous, then this makes operation turns into an optimization problem itself. It’s like trying to find a global maximum of a continuous function which is non-trivial, especially if the function is not convex. A similar issue exists in higher dimensional action spaces, lots of possible actions to evaluate. It would be nice if we could map a given state to an action directly. Even if the resulting policy is a bit more complex, it would significantly reduce the computation time needed to act, and that’s something a policy-based method can enable.