So, we’re working with a small grid world example, with an agent who would like to make it all the way to the state and the bottom right corner as quickly as possible. So, how should the agent begin if it initially knows nothing about the environment? Well, probably, the most sensible thing for the agent to do at the beginning when it doesn’t know anything is just to behave randomly and see what happens. So, let’s say, when the agent encounters a new state, it just selects to go up down, left or right with equal probability. When the agent randomly selects an action in this way where each action has an equal chance of being selected, we say that it’s following the equiprobable random policy. So, for instance, at the start of the first episode, say the agent is in state one. Then, to select an action that spins the wheel which tells it to pick action up. Now, remember the world is slippery, so the agent may not actually move where it intends. But in this case, say the agent does move up, as a result, it receives a reward of negative one and ends up in state two. Then, say it follows the same process with randomly selecting an action. This time it selects action left. Again, the world is slippery, and let’s say when the agent executes action left, and instead, slides up and hits the wall which bounces to right back into state two, and it receives a reward of negative one. Say, the agent continues this process with randomly selecting an action until the end of the episode, and say that the full episode is shown here. Then, say the agent follows the same process one more time to collect a second episode. Certainly, there must be some valuable information in here. How can the agent consolidate this experience in a way that allows it to improve upon its currently very random strategy? I mean certainly, the agent can do much better than it’s doing now. So, how do we accomplish that? We’ll remember what we’re searching for. That’s the optimal policy. It tells us for each state which action or actions are most useful towards the goal of maximizing return, or getting as much cumulative reward as we can over all time steps. If the question really is that simple, from each state which action is best? Maybe, it makes sense to just pick the action that got the most reward in these episodes that we’ve collected so far. For instance, we see that in both of these episodes, the agent began in state one. When it’s selected action up, it ended up with a final score of seven obtained by just summing up all the rewards it received over the episode. When it selected action right, it ended up with a final score of six. Maybe, the agent can use this as an indication that action up might be better than action right when it’s in state one. This is a useful first step for thinking about how we might be able to estimate the optimal policy from interacting with the environment. In the upcoming videos, you’ll learn more about how to formalize this idea to turn it into an algorithm that can reliably obtained the optimal policy. For now, I should mention that two episodes, or even 10 or 20, is definitely not enough to truly have a strong understanding of the environment. Of course, this is partially because the agent hasn’t tried out each action from each state. For Instance, the agent has tried going up from state one, and it has also tried going right. But it doesn’t know what would happen if it instead decides to go down or left. We also have to remember that the dynamics are set up so that the agent only moves with 70 percent probability in the direction that it intends. What if by chance we got really unlucky and the agent always experienced the highly unlikely event that it always moved in the wrong direction. Well, then, it would be pretty bad if the agent inferred the best actions based on data that’s unlikely to repeat in the future. Of course, there’s a quick solution for this. That just involves collecting many more episodes. If you collect hundreds or thousands of episodes for instance, you should be able to form relatively better-informed decisions. In fact, this is a fundamental idea behind Monte Carlo methods in general. Even though the underlying problem involves a great degree of randomness, we can infer useful information that we can trust, just by collecting a lot of samples. For now, for the purposes of the small example, we’ll assume that two or three episodes is enough. But in real-world situations, you’ll have to interact with the environment for many more episodes. For the next couple of concepts, we will think deeply about what exactly we should do with these episodes that we’ve collected? What kind of information should we extract from them? How can it help us in our search for an optimal policy?