You now know that the Monte-Carlo estimate is unbiased but has high variance, and that the TD estimate has low variance but it is biased. What are these facts good for? See when you study ring force, you learned that the return G was calculated as the total discounter return. This way of calculating G, which is simply a Monte-Carlo estimate, has high variance. Well you then use a baseline to reduce the variance of the ring force algorithm. However, this baseline was also calculated using a Monte-Carlo approach. Let’s now assume you use deep learning to learn these baseline. Even if you still use the Monte-Carlo approach which has high variance, using function approximation still gives you an advantage. Namely, you now gain the power of generalization. That means when you encounter a new state S prime, whether you had visitor or not, your deep neural network will potentially come up with better estimates, since it’s been trained to generalize from similar data. Note that at this point, we’re still not using a critic even though we are using function approximation. This might be confusing as the literature is often not consistent. They recall a Monte Carlo estimate has high variance and no bias, and that the TD estimate has low variance but low bias. Now, the work critic implies that bias has been introduced and the Monte Carlo estimate is unbiased. If instead of using Monte Carlo estimates to train baselines, we used TD estimates, then we can say we have a critic. Sure, we will be introducing bias but we will be reducing variants thus improving our convergence properties and speeding up learning. In actor-critic methods, all we are trying to do is to continue to reduce the high-variance commonly associated with policy-based agents. By using a TD critic instead of a Monte-Carlo baseline, we further reduce the variance of policy-based methods. These leads to faster learning than policy-based agents alone and we also see better and more consistent convergence than value-based agents alone.