Hello. Once you finish writing your implementation, you can click on Jupyter on the top left corner. That’ll bring you to a list of files on the left here. You can open Monte_Carlo_Solution.ipython notebook to see one way that we solved the exercise. So, what I did first was I restarted the kernel and cleared off output’s just so it’s clear exactly what the code is doing. We’ll begin by importing the necessary packages and starting up the environment. Then, we print some information about the environment. So, there are 704 different states in the environment corresponding to 32 times 11 times two, and two possible actions corresponding to selecting to stick or to hit. Until we’ve written this four loop here that allows you to see some sample games where at each time step with each episode the agent selects either to stick or hit with 50% probability each. So, we see in these three sample games. In the first game the agent won and then it lost two games after that. You can run this for as many times as you want. So, in the next three games have lost all of those, and so on, where here it won the last game. So, our goal in this notebook is to figure out a different strategy that makes the agent more likely to win the game more often. Okay. So, on to part one. In this case, we are considering the policy where the player decides an action based only on her current score. So, if the sum of the cards we have is 18 or less, we figure that probably it’s okay if we ask for a new card and we do that with 80 percent probability. If the sum of the cards is bigger than 18, it’s probably too dangerous to accept a new cards. So, we’re most likely to decide to stick. In this case, there’s a lot of information in the state that we’re ignoring, like the dealer’s face-up card or, and whether or not we have a usable ace. But it’s a good place to start for now. So, this generate episode from limits to caustic function simulates that policy, and it returns a corresponding episode from start to finish. And we can use this code cell to run that function three times. We see again, sometimes we win, sometimes we lose. The reward is received only at the end of the game and is -1 if we lose, and +1 if we win. So, here, this episode lasted longer than the next two episodes. We see that at the initial time step no reward was received. No reward was received again for the second action, but then when the game ended we got a reward of +1 because the agent won the game. So, then your goal in this notebook was to use this function that simulates a very specific policy and estimate its corresponding action value function using this mc_prediction_q function. So, when you take a look at this mc_prediction_q function, you’ll notice that we keep track of three separate dictionaries, return_sum N and Q. To understand the role that these dictionaries play in the algorithm, we’ll take a look at the pseudocode. So, Q will just be exactly what it sounds like. It’s the Q-table which contains the estimates for the actual values corresponding to each state action pair. So, remember in Monte Carlo prediction we estimate the action value corresponding to each state action pair as just the average of the returns obtained or experienced after visiting that state action pair. So, one way to get the average is to just sum up all the numbers that we want to take the average of and divide by the total number of numbers until that total number is contained in this dictionary N. So, N of s, a is just the total number of times that we visited particular state action pair over all episodes, and this return_sum is just the sum of all of the returns obtained over all episodes after visiting a particular state action pair. So, returning to the mc_prediction_q function, we have our three dictionaries return_sum N and Q as we’ve discussed.Then, we just loop over episodes. So, we use that provided function to generate an episode that’s going to be a list of state action reward tuples. Until then, we use this zip command to separate the states actions and rewards into separate quantities. Until I’ve written some code here to press that out a little bit with some print statements, here’s an episode that lasted three time steps. So, here’s the initial state, the initial action, and then the reward that resulted. At the next time step we also got a reward of zero. The game ended. The agent won, so the agent got a reward of one, and this zip command is useful for separating the state’s actions and rewards from this original list of tuples. All right. So, what’s left? Let’s go back to the pseudocode to see. So, after generating an episode, we loop over time steps, look at the corresponding state action pair for each time step. If that’s the first time that we visited that pair and the episode, then we increment N by one and add the return that was experienced by the agent from that point onward to the corresponding entry and return sum. Now,remember in this Blackjack example, that first visit and every visit mc_prediction are equivalent. We found it easier to implement every visit mc_ prediction. So, we’ll do this update for every time step in each upset. So, then, once we have the corresponding updated values for return_sum and N, we can use them to also update our action value estimate or our Q table. So, returning to the code, that’s exactly what we’ve done here, where we precompute these discount factors that we’ll use to multiply it by the rewards before adding them up. So, then I’ve run that code cell. If you scroll down, we run that mc_prediction_q function for 500,000 episodes. That yields a corresponding Q table or an estimate of the action value function, and then we show you how to get the corresponding state value function. So, we’ve done that calculation for you, and we’ll plot that corresponding state value function. Okay. So, once the code has finished running, you’ll see that there are two plots corresponding to whether we have a usable ace or we don’t have a usable ace. But in either case, we see that the highest state values correspond to when the player sum is something like 20 or 21, because in that case we’re most likely to win the game. But in any event, I encourage you to take the time now to explore this solution in more detail. Read out any print statements that you need to tease out particular parts of the code. Then in the next part of the lesson and in this notebook, you’ll learn how to use what you’ve done here to obtain the optimal policy.