9 – M3 L5 09 A3C Asynchronous Advantage ActorCritic Parallel Training V2

Unlike in DQN, A3C does not use a replay buffer. The main reason we needed a replay buffer was so that we could decorrelate experienced topple. Let me explain. In reinforcement learning, an agent collects experience in a sequential manner. The experience collected at time step t plus 1 will be correlated to the experience collected at time step t because it is the action taken at time step t that is partially responsible for the reward and the state up served at time step t plus 1, and that they will influence all our future decisions. There is no way out of that. The replay buffer allows us to collect these experiences sequentially by adding them one at a time to the replay buffer for later processing. So, independent from the data collection process, we can randomly select experiences from the replay buffer into mini batches. The experiences in the mini batches will not show the same correlation. These allows us to train our network successfully. Interestingly, A3C replaces the replay buffer with parallel training. By creating multiple instances of the environment and agent and running them all at the same time, your agent will receive mini batches of the correlated experiences just as we need. Samples will be decorrelated because agents will likely be experiencing different states at any given time. Cool, right? On top of that, these way of training allows us to use on policy learning in our learning algorithm which is often associated with more stable learning.

%d 블로거가 이것을 좋아합니다: