4 – TD Control Sarsa Part 1

In this video, we’ll discuss an algorithm that doesn’t need us to complete an entire episode before updating the Q-Table. Instead, we’ll update the Q-Table at the same time as the episode is unfolding. In particular, we’ll only need this very small time window of information to do an update, and so here’s the idea. The current estimate for the value of selecting action right and state one is pulled from the Q-Table, it’s just six. So, what about the alternative estimate? Well, in the Monte Carlo case, we waited until the end of the episode, and added up all the rewards that we got along the way. But if we’re working with just the small time window, we don’t have access to what happens at those later time steps. So, how might we form an alternative estimate with this limited information? Well, here’s the idea. After we got the reward of negative one, we ended up in state two and selected action right. Our Q-Table actually already has an estimate for the return that’s likely to follow from that point onward. It’s just the estimated action value for state two and action right. So, our alternative estimate can just be negative one plus eight which is the value of the next state action pair. If you need to pause the video and think about what we’ve just done, please, take your time here. Then, just like in the Monte Carlo case, we can use this alternative estimate to update the Q-Table by just moving this value of six a little bit closer to seven. So, let’s say that we move this value to 6.2. Then, at the next time step, we repeat the same process where we update the entry in the Q-Table for state two and action right by just using the alternative estimate. The alternative estimate is just the reward we received plus the currently estimated value of the next state action pair. So, in this case, we’ll move the value of eight a little bit closer to nine which will yield a new value like 8.2. This video illustrates the main idea behind the first method we’ll use for temporal difference control, and we’ll soon dive into the details.

%d 블로거가 이것을 좋아합니다: