So, at this point you’re about to implement or you just implemented your first R method that can help an agent recover the optimal policy for an environment. You should be proud of all of your hard work and it’s about to pay off. Specifically, we’ll implement Constant-Alpha MC Control, we’ll also make sure that the values of Epsilon that we use when constructing the Epsilon greedy policy satisfy the GLIE conditions. This will ensure that we’ll converge to the optimal policy. Also, remember in the black jack example that first visit and every visit, Monte Carlo Control will yield identical results so you can feel free to implement either. We’ve implemented every visit Monte Carlo Control, so just keep that in mind when you’re looking at the solution. So, we’ve opened the pseudocode next to the solution notebook so that we can see how the implementation lines up. The first thing that we do in the pseudocode is initialize all of the values in the Q table to zero. That’s also the first thing that we do here in the solution notebook. So, Q is initialized to an empty dictionary of arrays where we also need to parse in the total number of actions that are in the environment. So we keep track of that with the variable nA that’ll come in useful actually throughout this implementation. So after initializing the Q table, we loop over episodes and then with each episode we set the value of Epsilon, construct the corresponding Epsilon greedy policy with respect to the most recent estimate of the Q table and then generate an episode using that Epsilon greedy policy. We’ve also done the same thing in the code here where we loop over episodes, this is just a monitor that helps us keep track of how many episodes we’ve completed out of the total, but in any event for each episode we update the value of Epsilon and then generate an episode by following the Epsilon greedy policy. So these two steps with constructing the Epsilon greedy policy and generating an episode are wrapped up in this generate_episode_from_Q function that we’ve implemented above. So before we look at this generate_episode_from_Q function, let’s look a bit closer at how we set the value of Epsilon. So, before entering this loop over episodes, we start off the value of Epsilon at some initialized value Epsilon start. So, we see here that that value is currently set to one. But then for each episode we slightly decay the value of Epsilon by multiplying it by this value Epsilon decay. So that value is nearly one, so it just decays a bit, but we don’t want Epsilon to get too small because we want to constantly ensure at least some small amount of exploration throughout the process. So, we won’t let the value of Epsilon go below this value Epsilon min which is set to 0.05. Then after this value of Epsilon is updated it’s sent to this generate_episode_from_Q function, you can scroll up to see how that works. So, you’ll notice that the function takes the environment, the most recent estimate of the Q table, the value of Epsilon and the total number of actions available to the agent as input. As output, it returns an episode which is just a sequence of state action reward tuples. So, along the way the agent will use the Epsilon greedy policy to select actions. So, we’ve implemented that using the random.choice method from NumPy. So, that takes us input the set of possible actions and we also have to feed it the probabilities corresponding to the Epsilon greedy policy. So specifically, when selecting an action from a state we need to get the probabilities, the action probabilities corresponding to Epsilon greedy action selection from that state. That’s implemented in this get_probs function which takes as input along with Epsilon and the number of actions available to the agent, the row in the Q table corresponding to the state that the agent has to select the action from. So you can take a look at that get_probs function here. So, once we have that episode, we’ll go back to the pseudocode, once we have that episode we just look at each state action pair that appears in the episode for our first visit Monte Carlo Control, if it’s a first visit then we apply this update equation. Remember in the solution we actually implemented every visit Monte Carlo Control, and so we apply this update equation for every state action pair that appears in the episode. So, in the code we’ve implemented this update equation in the update_Q function here, and so to see how it ties into the Monte Carlo Control function you’ll notice that after generating an episode we parse the last estimate of the Q table into this update_Q function which gives us a new value for the Q table. Until as long as you run this algorithm for enough episodes, it’s guaranteed that this Q table will correspond to the optimal action value function. Then to get a corresponding optimal policy, all we need to do is for each state just pick the action that maximizes the row corresponding to that state. So, that’s what we’ve done here. So, we’ll run this code cells. So we’re running the Monte Carlo Controlled method for 500,000 episodes and with the value of Alpha set to 0.02. So, once that finishes running, you’ll have plotted, you can run the next code cell to plot the corresponding estimated optimal state value function. Remember there are two plots corresponding to whether we do or don’t have a usable ace, and if you run the next code cell then you’ll have the estimated optimal policy. You can compare that to the true optimal policy that’s pulled from the textbook. So, you’ll notice here that they don’t exactly line up, but if you want you can go ahead and run the algorithm for longer, play around with tweaking the value of Alpha and how Epsilon is allowed to decay an algorithm to see how close you can get to that optimal policy.