To understand how to go about searching for the best policy, it will help to have a running example. So consider this very very small world and an agent who lives in it. Say the world is primarily composed of nice patches of grass, but two out of the nine locations in the world have large mountains. Well think of each of these nine possible locations in the world as states and the environment. At each point in time, let’s say the agent can only move up, down, left or right, and can only take actions that lead it to not fall off the grid. Here, the arrows show the possible movements that will allow the agent. Let’s also say, the goal of the agent is to get to the bottom right hand corner of the world as quickly as possible. Then we’ll think of this as an episodic task where an episode finishes when the agent reaches the goal. So we won’t have to worry about transitions away from this goal state. Furthermore, say that the agent receives a reward of negative one for most transitions. But if an action leads it to encounter a mountain, it receives some reward of negative three. And if it reaches the goal state, it gets a reward of five. So we can think of the reward signal as punishing the agent for every timestep that it spends away from the goal. You can think of the mountains as having a special larger punishment because they take even longer to cross than the patches of grass. The reward structure encourages the agent to get to the goal as quickly as possible. And when it reaches the goal, it gets a reward of five, and the episode ends. We’ll use this example in our search for the best policy.