We’ve seen that we use a Markov decision process or MDP as a formal definition of the problem that we’d like to solve with reinforcement learning. In this video, we specify a formal definition for the solution to this problem. We can start to think of the solution as a series of actions that need to be learned by the agent towards the pursuit of a goal. For instance, in order to walk, the humanoid robot needs to learn the appropriate way to apply forces to its joint. But as we’ve seen, the correct action changes the situation. If a robot encounters a wall, the best series of actions will be different than if it had nothing blocking its path. Reward is always decided in the context of the state that it was decided in along with the state that follows. With this in mind, as long as the agent learns an appropriate action response to any environment state that it can observe, we have a solution to our problem. This motivates the idea of a policy. The simplest kind of policy is a mapping from the set of environment states to the set of possible actions. In case you’re new to the idea of a mapping, you can think of a policy as a factory that takes any environment state as input and outputs the corresponding action that the agent will take. If the agent wants to keep track of its strategy, all it needs to do is to build this factory or to specify this mapping. We call this kind of policy a deterministic policy. Income is the state, outcome is the action to take. And as you can see, it’s most common to denote the policy with the Greek letter pi. Another type of policy that we we’ll examine is a stochastic policy. The stochastic policy will allow the agent to choose actions randomly. We define a stochastic policy as a mapping that accepts an environment state S and action A and returns the probability that the agent takes action A while in state S. For clarity, let’s revisit the recycling robot example from the previous lesson. The deterministic policy would specify something like whenever the battery is low, recharge it. And whenever the battery has a high amount of charge, search for cans. The stochastic policy does something more like whenever the battery is low, recharge it with 50 percent probability, wait where you are with 40 percent probability. And otherwise, search for cans. Whenever the battery is high, search for cans with 90 percent probability. And otherwise, wait for a can. It’s important to note that any deterministic policy can be expressed using the same notation that we generally reserve for a stochastic policy. For instance, this policy can be expressed as, whenever the battery is low, recharge it with 100 percent probability. Whenever the battery has a high amount of charge, search for cans with 100 percent probability we will explore policies with varying levels of randomness or stochasticity in this course. Now, at this point you might be wondering. Now that we know how to specify a policy, what steps can we take to make sure that the agent’s policy is the best one. We will work towards answering this question in the next few concepts.