8 – Discounted Return

We’ve discussed how an agent might choose actions with the goal of maximizing expected return but we need to dig a bit deeper. For instance, consider our puppy agent, how does he predict how much reward he could get at any point in the future? Puppies can live for decades. Can he really be expected to have just as much of an idea of how much reward he’ll get now as he does five years from now? Does it make more sense to consider that it’s not entirely clear what the future holds especially if the puppy is still learning, proposing, and testing hypotheses and changing his strategy? It’s unlikely that he’ll know one thousand times steps in advance what his reward potential is likely to be. In general, the puppy is likely to have a much better idea of what’s likely to happen in the near future than he does for a distant time points. Along these lines then, should present reward carry the same weight as future reward? Maybe it makes more sense to value rewards that come sooner more highly, since those rewards are more predictable. That is, if I told you, “I’ll either definitely give you a marshmallow now or probably give you one a day from now.” Wouldn’t you prefer to just have it now? Whatever today’s marshmallow is worth to you, tomorrow’s marshmallow is probably only worth a percentage of that to you. Say, 90 percent or 10 percent less than today’s marshmallow. After all, there’s a chance you might not even get it. If we continue the trend with the day after tomorrow’s marshmallow, which is even less guaranteed, well, it would make sense that it’s worth even less. This situation motivates the idea of discounting and discounted return. Remember that the goal of the agent is always to maximize cumulative reward. And towards this end, at an arbitrary time step t, it can choose the action that maximizes the return. And currently, each time step from t plus one onward has an equal say in how the agent should make decisions. What if instead, we wanted time steps that occurred earlier in time to have a much greater say. Well, then instead of maximizing this sum, the idea is that we’ll maximize a different sum with rewards that are farther along in time are multiplied by smaller values. We refer to this sum as discounted return. By discounted, we mean that we’ll change the goal to care more about immediate rewards rather than rewards that are received further in the future. But how do we choose what values to use here? Well, in practice, we’ll define what’s called a discount rate, which is always denoted by the Greek letter gamma, and is always a number between zero and one. Then, as for the values, the first term is multiplied by gamma. The second term is multiplied by gamma square. Then, gamma to the third power and so on. In this way, we have a nice T.K. where rewards that occur earlier in time are always multiplied by a larger number. It’s important to note that this gamma is not something that’s learned by the agent. It’s something that you set to refine the goal that you have for the agent. So how exactly might you set the value of gamma? Let’s begin by looking at what happens when we set gamma to one. So we plug in one everywhere we see gamma and we see it yields the original, completely un-discounted return from the previous videos. And what about when gamma is set to zero? In this case, every term in this sum disappears with the exception of the most immediate reward. In this way, we see that the larger you make gamma, the more the agent cares about the distant future. And as gamma gets smaller and smaller, we get increasingly extreme discounting, where in the most extreme case, the agent only cares about the most immediate reward. It’s important to note that discounting is particularly relevant to continuing tasks. Remember that a continuing task is one where the agent environment interaction goes on without end. In this case, if the agent wants to maximize cumulative reward while it’s a pretty difficult task if the feature is limitless. So we use discounting to avoid having to look too far into the limitless future. But it’s important to note that with or without discounting, the goal is always the same. It’s always to maximize cumulative reward. The discount rate comes in when the agent chooses actions at an arbitrary time step. It uses the discount rate as part of its program for picking actions. And that program is more interested in securing rewards that come sooner and are more likely than the rewards that come later and are less likely. You’ll learn more about how exactly the agent should select actions in the next lesson. For now, we’ll focus on fully specifying the reinforcement learning problem.

%d 블로거가 이것을 좋아합니다: