5 – TD Control Sarsa Part 2

We began this lesson by reviewing Monte Carlo Control. Remember this was the corresponding update equation. In order to use it, we sample a complete episode. Then, we look up the current estimate and the Q table, and compare it to the return that we actually experienced after visiting the state action pair. We use that new return to make our Q table a little more accurate. But then, you learned how to change the update equation to only use a very small time window of information. Instead of using the return as an alternative estimate for updating the Q table we use the sum of the immediate reward and the discounted value of the next state action pair. In the small grid world example we assumed Gamma was equal to one but this need not be the case for a general MDP. This will yield a new control method that we can use for both continuing and episodic tasks. With the exception of this new update step, it’s identical to what we did in the Monte Carlo case. In particular, we’ll use the Epsilon greedy policy to select actions at every time step. The only real difference is that we update the Q table at every time step instead of waiting until the end of the episode, and as long as we specify appropriate values for Epsilon the algorithm is guaranteed to converge to the optimal policy. The name of this algorithm is Sarsa zero also known as Sarsa for short the name comes from the fact that each action value update uses a state action reward, next state, next action, tuple of interaction.

%d 블로거가 이것을 좋아합니다: