6 – TD Control_ Sarsamax

So far, you already have one algorithm for temporal difference control. Remember that in the Sarsa algorithm, we begin by initializing all action values to zero in constructing the corresponding Epsilon Greedy policy. Then, the agent begins interacting with the environment and receives the first state. Next, it uses the policy to choose it’s action. Immediately after it, it receives a reward and next state. Then, the agent again uses the same policy to pick the next action. After choosing that action, it updates the action value corresponding to the previous state action pair, and improves the policy to be Epsilon Greedy with respect to the most recent estimate of the action values. For the remainder of this video, we’ll build off this algorithm to design another control algorithm that works slightly differently. This algorithm is called Sarsamax, but it’s also known as Q-Learning. We’ll still begin with the same initial values for the action values and the policy. The agent receives the initial state, the first action is still chosen from the initial policy. But then, after receiving the reward and next state, we’re going to do something else. Namely, we’ll update the policy before choosing the next action. And can you guess what action makes sense to put here? Well, in the Sarsa case, our update step was one step later and plugged in the action that was selected using the Epsilon Greedy policy. And for every step of the algorithm, it was the case that all of the actions we used for updating the action values, exactly coincide with those that were experienced by the agent. But in general, this does not have to be the case. In particular, consider using the action from the Greedy policy, instead of the Epsilon Greedy policy. This is in fact what Sarsamax or Q-Learning does. And in this case, you can rewrite the equation to look like this where we rely on the fact that the greedy action corresponding to a state is just the one that maximizes the action values for that state. And so what happens is after we update the action value for time step zero using the greedy action, we then select A1 using the Epsilon greedy policy corresponding to the action values we just updated. And this continues when we received a reward and next state. Then, we do the same thing we did before where we update the value corresponding to S1 and A1 using the greedy action, then we select A2 using the corresponding Epsilon greedy policy. To understand precisely what this update stuff is doing, we’ll compare it to the corresponding step in the Sarsa algorithm. And in Sarsa, the update step pushes the action values closer to evaluating whatever Epsilon greedy policy is currently being followed by the agent. And it’s possible to show that Sarsamax instead, directly attempts to approximate the optimal value function at every time step. Soon, you’ll have the chance to implement this yourself and directly examine the difference between these two algorithms.

%d 블로거가 이것을 좋아합니다: