There is another way for estimating expected returns called the lambda return. The intuition goes this way. Say after you try and step bootstrapping you realize that numbers of n larger than one often perform better. But it’s still hard to tell what the number should be. Should it be a two, three, six or something else? To make the decision even more difficult, in some problems small numbers of n are better while in others, large numbers of n are better. How do you get this right? The idea of the lambda return is to create a mixture of all n-step bootstrapping estimates out once. Lambda is a hyperparameter used for waiting the combination of each n-step estimate to the lambda return. Say you set lambda to 0.5. The contribution to the lambda return would be a combination of all n-step returns weighted by the exponentially decaying factor across the different n-step estimates. Notice how the weight depends on the value of lambda you set and it decays exponentially at the rate of that value. So, for calculating the lambda return for state s at time step t, we would use all n-step returns and multiply each of the n-step return by the currents bonding weight. Then add them all up. Remember, that sum will be the lambda return for state s at time step t. Interestingly, when lambda is set to zero, the two-step, three-step and all n-step return other than the one step return, will be equal to zero. So, the lambda return when lambda is set to zero will be equivalent to the td estimate. If your lambda is set to one all n-step return other than the infinite step return will be equal to zero. So, the lambda return when lambda is set to one, will be equivalent to the Monte Carlo estimate. Again, a number between zero and one gives a mixture of all n-step bootstrapping estimate. But isn’t this amazing that a single algorithm can do that? I think is beautiful. Generalized Advantage Estimation, GAE is a way to train the critic with this lambda return. You can fit the advantage function just like in A3C and A2C or using a mixture of n-step bootstrapping estimates. It’s important to highlight that this type of return can be combined with virtually any policy-based method. In fact, in the paper that introduce GAE, TRPO was that policy-based method used. By using this type of estimation, this algorithm, TRPO plus GAE trains very quickly because multiple value functions are spread around on every time step due to the lambda return star estimate.