2 – L602 Gridworld Example RENDER V2-2

To illustrate the algorithms we’ll discuss in this lesson, it’ll help to work with a small example of a reinforcement learning task. So, say we have an agent in a world with only four possible states, here, marked by stone, brick, wood, or a grass. Say that at the beginning of an episode, the agent always starts in state one and its goal is to reach state four, which is a terminal state. But that seems a bit too easy. So, to add a bit of difficulty, we’ll add a big wall separating state one from state four. At each time step, the agent can move up, down, left, or right. Let’s say those actions mostly do what they’re supposed to do. So, if the agent is in state one and selects action up, it’s highly likely that it moves up and ends up in state two. But let’s say the world is really slippery, maybe the ground is covered in ice or bananas for instance. So, it’s also possible that the agent tries to go up but ends up slamming into a wall instead. So, in this case, let’s say that in general, if the agent decides to move in some direction, up for instance, it moves in that direction with 70 percent probability, but ends up moving in one of the other directions with 10 percent probability each. If an agent runs into a wall at the next time step, it just ends up in the same state where it started. So let’s see. We have four states, four actions. We know how the environment decides the next state until we’ve almost completely specified how the environment should work, but we still need to talk about how the agent receives reward. So, let’s say that the agent gets a reward of negative one for most transitions. But if it lands in the terminal state, it gets a reward of 10. Then this ensures that the goal of the agent will be to get to that terminal state as quickly as possible. For simplicity, let’s say that the discount rate is one. So, in other words, we won’t discount. Now, the task is completely defined. Soon, we’ll take some first steps towards developing an algorithm that can determine the optimal policy for the small example. Then, later, you’ll be able to apply what you learn to much larger problems that look more like what you’d encounter in a real-world setting.