5 – Bellman Equations

If you take the time yourself to calculate the value function for this policy, you might notice that you don’t need to start your calculations from scratch every time. In particular, you don’t need to look at the first state then add up all the rewards along the way. Then look at the second state, add up all those rewards. Then the third, add up all those and so on. It turns out to be redundant effort. Instead there’s something much faster that you can do. So, let’s erase most of these values with the exception of the ones at the bottom. And let’s see how we might work backwards to recalculate those values. And along the way we’ll discover that the value function has a nice recursive property. To see that, let’s say we’re trying to calculate the value of the state that I’ve highlighted here. And, that’s just the sum reward that follows until we reach the terminal state. And in this case, the agent starts in the state, then follows the policy, gets a reward of negative one, and lands at this next state with the value of two. And what’s important to notice here is that this two already corresponds to the sum of all the rewards that follow all the way to the end. So, instead of recalculating that sum, we could just use that value of two to get the value of the state as negative one plus two or one. For the same reason, the value of the next state is negative one plus one or zero, then negative three plus zero or negative three and so on. In this way, we see that we can express the value of any state as the sum of the immediate reward plus the value of the state that follows. And what’s important to note, is that for simplicity, I set the discount rate in this example to one but in general we want to have a framework that takes discounting into account. So, we’ll need to use the discounted value of the state that follows. We can express this idea in terms of what’s known as the Bellman expectation equation where for a general MDP we have to calculate the expected value of the sum. This is because, in general, with more complicated worlds, the immediate reward and next state cannot be known with certainty. This equation is very important and we’ll use it extensively in future lessons. But for now, all that’s important to remember is the main idea. And that idea is, we can express the value of any state in the MDP in terms of the immediate reward and the discounted value of the state that follows.

%d 블로거가 이것을 좋아합니다: