15 – M3 L5 15 DDPG Deep Deterministic Policy Gradient Soft Updates V1

Two other interesting aspects of DDPG are first, the use of a replay buffer, and second, the soft updates to the target networks. You already know how the replay buffer part works. I just wanted to mention that DDPG uses a replay buffer. But the soft updates are a bit different. In DQN, you have two copies of your network weights, the regular and the target network. In the Atari paper in which DQN was introduced, the target network is updated every 10,000 time steps. You simply copy the weights of your regular network into your target network. That is the target network is fixed for 10,000 time steps and then he gets a big update. In DDPG, you have two copies of your network weights for each network, a regular for the actor, an irregular for the critic, and a target for the actor, and a target for the critic. But in DDPG, the target networks are updated using a soft updates strategy. A soft update strategy consists of slowly blending your regular network weights with your target network weights. So, every time step you make your target network be 99.99 percent of your target network weights and only a 0.01 percent of your regular network weights. You are slowly mix in your regular network weights into your target network weights. Recall, the regular network is the most up today network because it’s their one where training, while the target network is the one we use for prediction to stabilize strain. In practice, you’ll get faster convergence by using this update strategy, and in fact, this way for updating the target network weights can be used with other algorithms that use target networks including DQN.

%d 블로거가 이것을 좋아합니다: