Remember the cute puppy from the previous lesson? He set the stage as an agent who learns from trial and error how to behave in an environment to maximize reward. But what do we mean when we talk about reinforcement learning in general? Well, you might be surprised to hear that not much changes when we trade this puppy for a self-driving car, a robot, or a more in general, reinforcement learning agent. In particular, the RL framework is characterized by an agent learned to interact with its environment. We assume that time evolves and discrete timesteps. At the initial timestep, the agent observes the environment. You can think of this observation as a situation that the environment presents to the agent. Then, it must select an appropriate action in response. Then at the next timestep in response to the agents action, the environment presents a new situation to the agent. At the same time the environment gives the agent a reward which provides some indication of whether the agent has responded appropriately to the environment. Then the process continues where at each timestep the environment sends the agent an observation and reward. And in response, the agent must choose an action. In general, we don’t need to assume that the environment shows the agent everything he needs to make well-informed decisions. But it greatly simplifies the underlying mathematics if we do. So in this course, we’ll make the assumption that the agent is able to fully observe what ever state the environment is in. And instead of referring to the agent as receiving an observation, well, Huntsworth say that it receives the environment state. But let’s make this description a bit clearer with some added notation where we again start from the very beginning at timestep zero. The agent first receives the environment state which we denote by S0, where zero stands for a timestep zero of course. Then, based on that observation the agent chooses an action, A0, at the next timestep, in this case, it timestep one and that’s a direct consequence of the agent’s choice of action, A0. And the environments previous state, S0, the environment transitions to a new state, S1, and gives some reward, R1, to the agent. The agent then chooses an action, A1. At timestep two, the process continues where the environment passes the reward in state. Then the agent responds with an action and so on. Whereas the agent interacts with the environment, this interaction is manifest as a sequence of states, actions, and rewards. That said, the reward will always be the most relevant quantity to the agent. To be specific, any agent has the goal to maximize expected cumulative reward or the some of the rewards attained over all timesteps. In other words, it seeks to find the strategy for choosing actions with the cumulative reward is likely to be quite high. And the agent can only accomplish this by interacting with the environment. This is because at every timestep, the environment decides how much reward the agent receives. In other words, the agent must play by the rules of the environment. But through interaction, the agent can learn those rules and choose appropriate actions to accomplish its goal. And this is essentially what we’ll try to accomplish in this course. But it’s important to emphasize that all of this is just a mathematical model for a real world problem. So if you have a problem in mind that you think can be solved with reinforcement learning, you will have to specify the states, actions, and rewards, and you’ll have to decide the rules of the environment in this course. You’ll see a lot of examples for how to accomplish this.