Let’s explore two very distinct and complimentary ways for estimating expected returns. On the one hand, you’d have the Monte-Carlo estimate. The Monte-Carlo estimate consists of rolling out an episode in calculating the discounter total reward from the rewards sequence. For example, in an episode A, you start in state S_t, take action A_t. The environment then transitions gives you a reward R_t plus 1 and sends you to a new state S_t plus 1. Then, you continue with a new action A_t plus 1 and so on until you reach the end of the episode. The Monte-Carlo estimate just at other rewards up, whether discounted or not. When you then have a collection of episodes A, B, C, and D, some of those episodes will have trajectories that go through the same states. Each of these episodes can give you a different Monte-Carlo estimate for the same value function. To calculate the value function, all you need to do is average the estimates. Obviously, the more estimates you have when taking the average, the better your value function will be. On the other hand, you have the temporal difference or TD estimate. Say we’re estimating a state value function V. For estimating the value of the current state, it uses a single rewards sample. In an estimate of the discounted total return, the agent will obtain from the next state onwards. So, you’re estimating with an estimate. For example, in episode A, you start in state S_t, take action A_t, the environment then transitions gives you a reward R_t plus 1, and sends you to a new state S_t plus 1. But then you can actually stop there. By the magic of dynamic programming, you are allowed to do what is called bootstrapping, which basically means that you can leverage the estimate you’re currently have for the next state in order to calculate a new estimate for the value function of the current state. Now, the estimates of the next state will probably be off particularly early on, but that value will become better and better as your agencies more data, making in turn other values better, clever right? After doing this many many times, you will have estimated the desired value function well. As you can imagine, Monte-Carlo estimates will have high variance because estimates for a state can vary greatly across episodes. G_t, A here could be minus 100, while G_t, B could be plus 100, and G_t, C plus 1,000. The reason these high variance is likely, is because you are compounding lots of random events that happened during the course of a single episode. But Monte-Carlo methods are unbiased. You are not estimating using estimates. You are only using the true rewards you obtained. So, given lots and lots of data, your estimates will be accurate. TD estimates are low variance because you’re only compounding a single time step of randomness instead of a full rollout. Though because you’re bootstrapping on the next state estimates and those are not true values, you’re adding bias into your calculations. Your agent will learn faster, but we’ll have more problems converging.