14 – M3 L5 14 DDPG Deep Deterministic Policy Gradient Continuous Actionspace V1

DDPG is a different kind of actor-critic method. In fact, it could be seen as approximate DQN, instead of an actual actor critic. The reason for this is that the critic in DDPG, is used to approximate the maximizer over the Q values of the next state, and not as a learned baseline, as we have seen so far. Though, this is still a very important algorithm and it is good to discuss it in more detail. One of the limitations of the DQN agent is that it is not straightforward to use in continuous action spaces. Imagine a DQN network that takes inner state and outputs the action value function. For example, for two action, say, up and down, Q(s, “up”) gives you the estimated expected value for selecting the up action in state s, say minus 2.18. Q(s “down”), gives you the estimated expected value for selecting the down action in state s, say 8.45. To find the max action value function for this state, you just calculate the max of these values. Pretty easy. It’s very easy to do a max operation in this example because this is a discrete action space. Even if you had more actions say a left, a right, a jump and so on, you still have a discrete action space. Even if it was high dimensional with many, many more actions, it would still be doable. But why do you need an action with continuous range? How do you get the value of a continuous action with this architecture? Say you want the jump action to be continuous, a variable between one and 100 centimeters. How do you find the value of jump, say 50 centimeters. This is one of the problems DDPG solves. In DDPG, we use two deep neural networks. We can call one the actor and the other the critic. Nothing new to this point. Now, the actor here is used to approximate the optimal policy deterministically. That means we want to always output the best believed action for any given state. This is unlike a stochastic policies in which we want the policy to learn a probability distribution over the actions. In DDPG, we want the believed best action every single time we query the actor network. That is a deterministic policy. The actor is basically learning the argmax a Q(S, a), which is the best action. The critic learns to evaluate the optimal action value function by using the actors best believed action. Again, we use this actor, which is an approximate maximizer, to calculate a new target value for training the action value function, much in the way DQN does.

%d 블로거가 이것을 좋아합니다: