You know now that an actor-critic agent is an agent that uses function approximation to learn a policy any value function. So, we will then use two neural networks; one for the actor and one for the critic. The critic will learn to evaluate the state value function V Pi using the TD estimate. Using the critic, we will calculate the advantage function and train the actor using this value. A very basic online actor-critic agent is as follows. You have two networks. One network, the actor, takes in a state and outputs the distribution over actions. The other network, the critic takes in a state and outputs a state value function of policy Pi, V Pi. The algorithm goes like this. Input the current state into the actor and get the action to take in that state. Observe next state and reward to get your experienced double s, a, r, s prime. Then, using the TD estimate which is their reward r plus the critic’s estimate for s prime. So, r plus Gamma times V of s prime, you train the critic. Next, to calculate the advantage a Pi s, a equals r plus Gamma times V of s prime minus V of s. We also use the critic. Finally, we train the actor using the calculated advantage as a baseline. Easy, right? Let me show you some of the most popular actor-critic agents to date.