In case you’re not clear on what on-policy versus off-policy learning is, let me explain that real quick. On-policy learning is when the policy use for interacting with the environment is also the policy being learned. Off-policy learning is when the policy used for interacting with the environment is different than the policy being learned. Sarsa is a good example of an on-policy learning agent. A sarsa agent uses the same policy to interact with the environment as the policy it is learning. On the other hand, Q-learning is a good example of a off-policy learning agent. A Q-learning agent learns about the optimal policy, though the policy that generates behavior is an exploratory policy, often epsilon greedy. Looking at the update equations for these two methods, helps us understand them better. As you can see in Sarsa the action used for calculating the TD target and the TD error is the action the agent will take in the following time step A prime. In Q-learning however, the action used for calculating the target is the action with the highest value. But this action is not guaranteed to be used by the agent for interaction with the environment in the following time step. In other words, this is not necessarily A prime. The Q-learning agent may choose an exploratory action in the next step. In Sarsa that action exploratory or not is already been chosen. Q-learning learns the deterministic optimal policy even if its behaviour policy is totally random, Sarsa learns the best exploratory policy, that is the best policy that still explores. DQN is also an off-policy learning method, your agent behaves with some exploratory policy say epsilon greedy where they learns about the optimal policy. When using off-policy learning, agents are able to learn from many different sources including experiences generated by all versions of the agent itself thus the replay buffer. However, off-policy learning is known to be unstable and often diverge with deep neural networks. A3C on the other hand is a non-policy learning method. With on-policy learning, you only use the data generated by the policy currently being learned about, and anytime you improve the policy, you toss out old data and go out collect some more. On-policy learning is a bit inefficient in the use of experiences, but it often has more stable and consistent convergence properties. A simple analogy of on and off-policy goes this way. On-policy is learning from your own hands-on experience for example, the projects in this nano degree. As you can imagine that is a pretty good way of learning, but it is somewhat data inefficient, you can only do so many projects before you run out of time. Off-policy on the other hand is learning from someone else’s experience and as such it is more sample efficient because well, you can learn from many different sources. However, these way of learning is more prone to misunderstandings, I might not be able to explain things in a way that you understand well. The nano degree analogy in the off-policy case is learning from watching the lessons for example. You learn much faster this way but again, perhaps not as good and deep as you would from your own hands-on experience. Usually a good balance between these two ways of learning allows for the best-performing deep reinforcement learning agents and perhaps the best-performing humans too, who knew? Now, if you’re interested in this topic and you want to learn about an agent that combines on and off-policy learning, I recommend you read the paper by Goodall title Q-Prop Sample Efficient Policy Gradient with an off-policy critic. But reading these off-policy learning. So if you really want to learn about this topic after reading the paper go ahead and implement a queue prop agent yourself for your project. That will be challenging and fun.