9 – MDPs, Part 1

So far, you’ve just started a conversation to set the stage for what we’d like to accomplish. We’ll use the remainder of this lesson to specify a rigorous definition for the reinforcement learning problem. For context, we’ll work with the example of a recycling robot from the Sutton textbook. So consider a robot that’s designed for picking up empty soda cans. The robot is equipped with arms to grab the cans and runs on a rechargeable battery. There’s a docking station set up in one corner of the room and the robot has to sit at the station if it needs to recharge its battery. Say, you’re trying to program this robot to collect empty soda cans without human intervention. In particular, you want the robot to be able to decide for itself when it needs to recharge its battery. And whenever it doesn’t need to recharge, you want it to focus on collecting as many soda cans as possible. So let’s see if we can frame this as a reinforcement learning problem. We’ll begin with the actions. We’ll say the robot is capable of executing three potential actions. It can search the room for cans, it can head to the docking station to recharge its battery, or it can stay put in the hopes that someone brings it a can. We refer to the set of possible actions as the action space, and it’s common to denote it with a script A. All right. What about the states? Remember, the states are just the context provided to the agent for making intelligent actions. So the state, in this case, could be the charge left on the robot’s battery. For simplicity, we’ll assume that the battery has one of two states. One corresponding to a high amount of charge left, and the other corresponding to a low amount of charge. We refer to the set of possible states as the state space and it’s common to denote with a script S. So intuition tells us that if the robot has a high amount of charge left on its battery, we’d like it to know to actively search for the room for cans. Searching the room should use up a lot of energy but this doesn’t matter so much because the battery has a lot of charge anyway. But if the state is low, searching for cans has pretty high risk because the battery could get depleted mid-search and then the robot would be stranded and that wouldn’t be so good because we don’t want to have to come to its rescue. So if the battery is low, maybe we’d like the robot to know to wait for a can or to go to recharge its battery. In the next few concepts, we’ll set up the problem with the ultimate goal of having the robots control equipment learn this behavior.

%d 블로거가 이것을 좋아합니다: