Let’s recall the problem at hand. We have an agent and environment. Time is broken into discrete time steps, and at every time step, the agent receives a reward and state from the environment, and chooses an action to perform in response. In this way, the interaction involved is a sequence of states actions and rewards. In this lesson, we will confine our attention to episodic tasks where the interaction stops at some time step capital T when the agent encounters a terminal state. We refer to the sequence as an episode. For any episode, the agent’s goal is to find the optimal policy in order to maximize expected cumulative reward. As we’ve seen, the agent can only accomplish this by interacting with the environment. In this lesson, we’ll dig deeply into a class of algorithms that help the agent to understand and leverage this interaction to obtain the optimal policy.