Now that you have some foundational concepts down, let me give you some intuition. Let’s say you want to get better at tennis. The actor or policy-based approaching you roughly learns this way. You play a bunch of matches. You then go home, lay on the couch, and commit to yourself to do more of what I did in matches I worn, and less of what I did in matches I lost. After many, many times repeating this process, you will have increased the probability of actions that lead to a win, and decrease the probability of actions that lead to losses. But you can see how these approaches rather inefficient as it needs loss of data to learn a useful policy. See, many of the actions that occur within the game that ended up in a loss could have been really good actions. So, decreasing the probability of good actions taken in a match only because you lost is not the best idea. Sure, if you repeat this process infinitely often, you’re likely to end up with a good policy, but at the cost of slow learning. It is clear that policy-based agents have high variance. The critic or a value-based approaching you learns differently. You start playing a match, and even before you get started, you start guessing what the final score is going to be like. You continue to make guesses throughout the match. At first, your guesses will be off. But as you get more and more experience, you will be able to make pretty solid guesses. The better your guesses, the better you’ll tell good from bad situations, or good from bad actions. The better you can make these distinctions, the better you’ll perform. Of course, given that you select good actions. Though, this is not a perfect approach either, guesses introduce a bias because they’ll sometimes be wrong, particularly because of a lack of experience. Guesses are prone to under or overestimation. Though, guesses are more consistent through time. If you think you’ll win a match five minutes into it, chances are you’ll still think so 10 minutes into it. This is what makes the TD estimate have lower variance. As you can see, in a policy-based approach, the agent is learning to act, and it is good at that. While in a value-based approach, the agent is learning to estimate situations and actions, and it’s pretty good at that. Combining these two approaches sounds like a great idea, and it often yields better results. Actor-critic agents learn by playing games and adjusting the probabilities of good and bad actions just as with the actor alone. But this time, you’ll also use a critic to be able to tell good from bad actions more quickly, and speed up learning. In the end, actor-critic agents are more stable than value-based agents, and need fewer samples than policy-based agents. Let’s now look at a basic actor-critic agent.