7 – TD Control_ Expected Sarsa

So far, you’ve implemented Sarsa and Sarsamax and we’lll now discuss one more option. This new option is called expected Sarsa and it closely resembles Sarsamax, where the only difference is in the update step for the action value. Remember that Sarsamax or Q learning took the maximum over all actions of all possible next state action pairs. In other words, it chooses what value to place here by plugging in the one action that maximizes the action value estimate corresponding to the next state. Expected Sarsa does something a bit different. It uses the expected value of the next state action pair, where the expectation takes into account the probability that the agent selects each possible action from the next state. Over the next couple concepts, you’ll write your own implementation of Expected Sarsa.

%d 블로거가 이것을 좋아합니다: