4 – The Reward Hypothesis

We’ve discussed the diverse applications of Reinforcement Learning. Each has a defining agent and environment, and each agent has a goal. Ranging from a car learning to drive itself to an agent learning to play Atari games. It’s truly amazing that all of these very different goals can be addressed with the same theoretical framework. So far, we’ve made sense of the idea of reward from the perspective of a puppy that interacts with its owner. In this case, the state did in the timestep was the command that the owner communicated to the puppy, the action was the puppy’s response, and the reward was just the number of treats. And like a good Reinforcement Learning Agent, the puppy seeks to maximize that reward. In this case, the idea of reward comes naturally. And it lines up well with the way we think about teaching a puppy. But in fact, the Reinforcement Learning Framework has any and all agents formulate their goals in terms of maximizing expected cumulative reward. But what could reward mean in the context of something like a robot learning to walk? Maybe we could think of the environment as a type of trainer that watches the robots movements and rewards it for having good walking form. But then the reward that it gives has the potential to be highly subjective and not scientific at all. I mean, what makes a walk good? And what makes it bad? And how do we address this? In general, how do we specify reward to describe any of a number of potential goals that our agents could have? Well before we answer this question, let’s take one step back. It’s important to note that the word “Reinforcement” and “Reinforcement Learning” is a term originally from behavioral science. It refers to a stimulus that’s delivered immediately after behavior to make the behavior more likely to occur in the future. The fact that this name is borrowed is no coincidence. In fact, it’s an important to defining hypothesis and reinforcement learning that we can always formulate an agents goal, along the lines of maximizing expected cumulative reward. And we call this hypothesis, the “Reward Hypothesis”. If this still seems weird or uncomfortable to you, you are not alone. But allow me to convince you in the next video.

%d 블로거가 이것을 좋아합니다: