So, in general, when the agent is interacting with the environment, and still trying to figure out what works and what doesn’t in its quest to collect as much reward as possible, creating policies are quite dangerous. To see this, let’s look at an example. Say you’re an agent, and there are two doors in front of you, you need to decide which one has more value. At the beginning, you have no reason to favor any door over the other. So let’s say you initialize your estimate for the value of each door to zero. In order to figure out which door to open, let’s say you flip a coin, it comes up tails and so, you open Door B. When you do that, you receive a reward of zero. Let’s say for simplicity that an episode finishes after a single door is opened. So in other words, after opening Door B, you received a return of zero. Okay. So, that doesn’t change the estimate of the value function. So, it makes sense to just pick a door randomly again. So you flip a coin, it comes up heads this time and so you open Door A. When you do this, you get a reward of one. This of course updates the estimate for the value of Door A to one. So now, if we act greedily with respect to the value function, then we open Door A again. This time we get a reward of three, this updates the value of Door A to two and so at the next point in time, the greedy policy says to pick Door A again, and say, “Every time we do that we get some positive reward and it’s always either one or three.” So, for all time we’re opening the same door. There’s a big problem with this because we never really got a chance to truly explore what’s behind the other door. For instance, consider the case that the mechanism behind Door A is what you’d expect. It yields a reward of one or three, where both are equally likely. But, the mechanism behind Door B, gives a reward of zero or 100. Then, that’s information that you would have liked to discover, but following the greedy policy has prevented you. So the point is, that when we got to a situation earlier in our investigation, where Door A seemed more favorable than Door B, we really needed to spend more time making sure of that, because our early perceptions were incorrect. So instead of constructing that greedy policy, a better policy would be a stochastic one. The pick Door A with 95 percent probability and Door B with 5 percent probability let’s say. Then, that’s pretty close to the greedy policy so we’re still acting pretty optimally, but there’s the added value that if we continue to select Door B with some small probability, then at some point we’re going to see that return of 100. This example motivates how we’ll need to modify our current approach. Instead of always constructing a greedy policy, but always selects the greedy action, what we’ll do instead, is construct a so-called Epsilon-greedy policy, that’s most likely to pick the greedy action. But with some small but nonzero probability, picks one of the other actions instead. In this case, you will set some small positive number Epsilon which must be between zero and one. Then, with probability one minus Epsilon, the agent selects the greedy action, and with probability Epsilon, it selects any action randomly. So the larger it is the more likely you are to pick one of the non-greedy actions. Then, as long as Epsilon is set to a small number, we have a method for constructing a policy that’s really close to the greedy policy, but the added benefit that it doesn’t prevent the agent from continuing to explore the range of possibilities.