10 – MDPs, Part 2

So we’re working with an example of a recycling robot and we’ve already detailed the states and actions. In this example, remember that the state corresponds to the charge left on the robot’s battery. And there are two potential states, high and low. As a first step, consider the case of the charge on the battery is high. Then, the robot could choose to search, wait, or recharge. But actually, recharging doesn’t make much sense if the battery is already high, so we’ll say that the only options are to search or wait. All right, so if the agent chooses to search, then at the next time step, the state could be high or low. Let’s say that with 70 percent probability, it stays high. So there’s a 30 percent chance the battery switches to low. In both cases, we’ll say that this decision to search led to the robot collecting exactly four cans. And in line with this, the environment gives the agent a reward of four. The other option is to wait. If the robot has a high battery and then decides to wait, well, waiting doesn’t use any battery at all and we’ll say that then, it’s guaranteed that the battery will again be high at the next time step. In this case, we’ll suppose that since the robot wasn’t out actively searching, it’s able to collect fewer cans and say it’s delivered just one can. And again in line with this, the environment gives the agent a reward of one. Onto the case where the battery is low. Again, the robot has three options. If the battery is low and it chooses to wait for people to bring cans, that doesn’t use any battery until the state at the next time step is going to be low. And just like when the robot decided to wait when the battery was high, the agent gets a reward of one. If the robot recharges, then it goes back to the docking station and the state of that the next time step is guaranteed to be high. Say it collects no cans along the way and gets a reward of zero. And if it searches, well, that’s risky. It’s possible that it gets away with this and then at the next time step, the battery is still low but not entirely depleted. But it’s probably more likely that the robot depletes its battery, has to be rescued and is carried to a docking station to be charged. So the charge on its battery at the next time step is high. So the robot depletes its battery with 80 percent probability and otherwise gets away with that risky action with 20 percent probability. As for the reward, if the robot needs to be rescued, we want to make sure we’re punishing the robot in this case, so say we don’t look at all at the number of cans it was able to collect and we just give the robot a reward of negative three for that. But if the robot gets away with it, he collects four cans and get the reward of four. This picture completely characterizes one method that the environment could use to decide the next state in reward at any point in time. To see this, let’s look at a concrete example. Say for instance, the last state was high and the agent decided to search. Then, the environment would flip a theoretical coin with 70 percent probability of landing heads, and if that coin landed heads, the environment would decide that the next state was high and the agent would get a reward four. Otherwise, if it landed tails, the next state would be low and the reward would be four. As another example, if the last state was low and the agent decided to search, the environment would again flip a theoretical coin now with 80 percent probability of landing heads. If it landed heads, the environment would decide the next state was high and the agent would get a reward of negative three. Otherwise, if it landed tails, the next state would be low and the reward would be four. But what’s important to emphasize here is how little information the environment uses to make decisions. It doesn’t care what situation was presented to the agent 10 or 100 or even two steps prior. And it doesn’t look at the actions that the agent took prior to the last one. And how well the agent is doing or how much reward it’s collected has no effect on how the environment chooses to respond to the agent. Of course, it’s possible to design environments that have much more complex procedures for interacting with the agent, but this is how it’s done in reinforcement learning, and you’ll see soon for yourself in your implementations just how powerful this framework is.

%d 블로거가 이것을 좋아합니다: