Let’s talk about bias and variance. In machine learning, we’re often presented there with a trade off between bias and variance. Let me give you some intuition first. Let’s say you’re a practicing your soccer shooting skills. The thing you want to do is to put the ball in the top right corner of the goal. You want to be able to repeatedly kicked the ball there. If after a day of training, you place the ball most of the time in the middle right, this means that you have a bias to shoot the ball lower. It also means that you have low variance because the shots where clumped together. Now, say the average of your shots were center on the top right corner, but most of your shots were spread around that spot. Then, you have low bias because you were mostly center where you were aiming in high variance because of the spread. Obviously, you want to avoid both high bias and high variance, and you want to have both low bias and low variance. The thing is, this is very hard to achieve, but we’ll look at several techniques that are designed to accomplish this. We have to consider the bias-variance tradeoff in reinforcement learning, when an agent tries to estimate value functions or policies from returns. A return is calculated using a single trajectory. However, value functions which is what we’re trying to estimate are calculated using the expectation of returns. A big part of the effort in reinforcement learning and research is an attempt to reduce the variance of algorithms while keeping bias to a minimum. You know by now that a reinforcement learning agent tries to find policies to maximize the total expected reward. But since we’re limited to sampling the environment, we can only estimate these expectation. The question is, what’s the best way to estimate value functions for our actor-critic methods.