2 – Another Gridworld Example

Let’s begin with a very small world and an agent who lives in it. The world is primarily composed of nice patches of grass, but one of the four locations in the world has a large mountain. We can think of each of these four possible locations in the world as states in the environment. At each point in time, let’s say the agent can only move up, down, left or right and can only take actions that lead it to not fall off the grid. Here, the arrows show the possible movements that we allow the agent. let’s also say the goal of the agent is to get to the bottom right hand corner of the world as quickly as possible. We’ll think of this as an episodic task where an episode finishes when the agent reaches the goal, so we won’t have to worry about transitions away from the school state. Furthermore, say that the agent receives a reward of a negative one for most transitions, but if an action leads it to encounter a mountain, it receives a reward of negative three and if it reaches the goal state, it gets a reward of five. In the dynamic programming setting, the agent knows this rewards structure and it knows how transitions happen between states. So the agent already knows everything about how the environment operates. So now what? How can the agent use this information to find the optimal policy?

%d 블로거가 이것을 좋아합니다: