A3C stands for Asynchronous Advantage Actor-Critic. As you can probably infer from the name, we’ll be calculating the advantage function. A Pi essay, and the critic will be learning to estimate V Pi to help with that just as before. If you’re using images as inputs to your agent, A3C can use a single convolutional neural network with the actor and critic sharing weights, in two separate heads, one for the actor, and one for the critic. Note that, A3C not to be used exclusively with CNN’s and images. But if you where to use it, sharing weights to some more efficient, more complex approach, and can be harder to train. It’s a good idea to start with two separate networks, and change it only to improve performance. Now, one interesting aspect of A3C, is that instead of using a TD estimate, it will use another estimate commonly referred to as n-step bootstrapping. N-step bootstrapping, is simply an abstraction and a generalization of a TD and Monte-Carlo estimates. TD is a one-step bootstrapping. Your agent goes out and experiences one-time-step of real rewards, and then bootstraps right there. Monte-Carlo goes out all the way, and it does not bootstrap because it doesn’t need to. Monte-Carlo estimate is an infinite step bootstrapping. But how about going more than one step, but not all the way out? Can we do two-time steps of real reward, and then bootstrap from the second next state? Can we do three? How about four or more? We sure can. This is what is called end-step bootstrapping, and A3C uses this return to train the critic. For example, on our tennis example, end-step bootstrapping means that you will wait a little bit before guessing what the final score will look like. Waiting to experience the environment for a little longer before you calculate the expected return of the original state, allows you to have less bias in your prediction, keeping variance under control. In practice, only a few steps out, say four or five steps bootstrapping, are often the best. By using n-step bootstrapping, A3C propagates values to the last end states visited, which allows for faster convergence with less experience required while still keeping variance under control.