Workspace – Introduction
You will write all of your implementations within the classroom, using an interface identical to the one shown below. Your Workspace contains the following files (among others):
- Monte_Carlo.ipynb – the Jupyter notebook where you will write all of your implementations (this is the only file that you will modify!)
- Monte_Carlo_Solution.ipynb – the corresponding instructor solutions
- plot_utils.py – contains a plotting function for visualizing state-value functions and policies
The Monte_Carlo.ipynb notebook can be found below. To peruse the other files, you need only click on “jupyter” in the top left corner to return to the Notebook dashboard.

The Workspace
Please do not write or execute any code just yet – for now, you’re encouraged to simply explore the file structure. In particular, make sure you know where to find the solution notebook. We’ll get started with coding within the Workspace soon!
Monte Carlo Methods
In this notebook, you will write your own implementations of many Monte Carlo (MC) algorithms.
While we have provided some starter code, you are welcome to erase these hints and write your code from scratch.
Part 0: Explore BlackjackEnv
We begin by importing the necessary packages.
import sys import gym import numpy as np from collections import defaultdict from plot_utils import plot_blackjack_values, plot_policy
Use the code cell below to create an instance of the Blackjack environment.
env = gym.make('Blackjack-v0')

print(env.observation_space) print(env.action_space)

Execute the code cell below to play Blackjack with a random policy.
(The code currently plays Blackjack three times – feel free to change this number, or to run the cell multiple times. The cell is designed for you to get some experience with the output that is returned as the agent interacts with the environment.)
for i_episode in range(3): state = env.reset() while True: print(state) action = env.action_space.sample() state, reward, done, info = env.step(action) if done: print('End game! Reward: ', reward) print('You won :)\n') if reward > 0 else print('You lost :(\n') break

Part 1: MC Prediction
In this section, you will write your own implementation of MC prediction (for estimating the action-value function).
We will begin by investigating a policy where the player almost always sticks if the sum of her cards exceeds 18. In particular, she selects action STICK
with 80% probability if the sum is greater than 18; and, if the sum is 18 or below, she selects action HIT
with 80% probability. The function generate_episode_from_limit_stochastic
samples an episode using this policy.
The function accepts as input:
bj_env
: This is an instance of OpenAI Gym’s Blackjack environment.
It returns as output:

def generate_episode_from_limit_stochastic(bj_env): episode = [] state = bj_env.reset() while True: probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8] action = np.random.choice(np.arange(2), p=probs) next_state, reward, done, info = bj_env.step(action) episode.append((state, action, reward)) state = next_state if done: break return episode
Execute the code cell below to play Blackjack with the policy.
(The code currently plays Blackjack three times – feel free to change this number, or to run the cell multiple times. The cell is designed for you to gain some familiarity with the output of the generate_episode_from_limit_stochastic
function.)
for i in range(3): print(generate_episode_from_limit_stochastic(env))

Now, you are ready to write your own implementation of MC prediction. Feel free to implement either first-visit or every-visit MC prediction; in the case of the Blackjack environment, the techniques are equivalent.
Your algorithm has three arguments:
env
: This is an instance of an OpenAI Gym environment.num_episodes
: This is the number of episodes that are generated through agent-environment interaction.generate_episode
: This is a function that returns an episode of interaction.gamma
: This is the discount rate. It must be a value between 0 and 1, inclusive (default value:1
).
The algorithm returns as output:
Q
: This is a dictionary (of one-dimensional arrays) whereQ[s][a]
is the estimated action value corresponding to states
and actiona
.
def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0): # initialize empty dictionaries of arrays returns_sum = defaultdict(lambda: np.zeros(env.action_space.n)) N = defaultdict(lambda: np.zeros(env.action_space.n)) Q = defaultdict(lambda: np.zeros(env.action_space.n)) # loop over episodes for i_episode in range(1, num_episodes+1): # monitor progress if i_episode % 1000 == 0: print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="") sys.stdout.flush() ## TODO: complete the function return Q
Use the cell below to obtain the action-value function estimate $Q$. We have also plotted the corresponding state-value function.
To check the accuracy of your implementation, compare the plot below to the corresponding plot in the solutions notebook Monte_Carlo_Solution.ipynb.
# obtain the action-value function Q = mc_prediction_q(env, 500000, generate_episode_from_limit_stochastic) # obtain the corresponding state-value function V_to_plot = dict((k,(k[0]>18)*(np.dot([0.8, 0.2],v)) + (k[0]<=18)*(np.dot([0.2, 0.8],v))) \ for k, v in Q.items()) # plot the state-value function plot_blackjack_values(V_to_plot)


Part 2: MC Control
In this section, you will write your own implementation of constant-𝛼 MC control.
Your algorithm has four arguments:
env
: This is an instance of an OpenAI Gym environment.num_episodes
: This is the number of episodes that are generated through agent-environment interaction.alpha
: This is the step-size parameter for the update step.gamma
: This is the discount rate. It must be a value between 0 and 1, inclusive (default value:1
).
The algorithm returns as output:
Q
: This is a dictionary (of one-dimensional arrays) whereQ[s][a]
is the estimated action value corresponding to states
and actiona
.policy
: This is a dictionary wherepolicy[s]
returns the action that the agent chooses after observing states
.
(Feel free to define additional functions to help you to organize your code.)
def mc_control(env, num_episodes, alpha, gamma=1.0): nA = env.action_space.n # initialize empty dictionary of arrays Q = defaultdict(lambda: np.zeros(nA)) # loop over episodes for i_episode in range(1, num_episodes+1): # monitor progress if i_episode % 1000 == 0: print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="") sys.stdout.flush() ## TODO: complete the function return policy, Q
Use the cell below to obtain the estimated optimal policy and action-value function. Note that you should fill in your own values for the num_episodes
and alpha
parameters.
# obtain the estimated optimal policy and action-value function policy, Q = mc_control(env, ?, ?)
Episode 500000/500000.
Next, we plot the corresponding state-value function.
# obtain the corresponding state-value function V = dict((k,np.max(v)) for k, v in Q.items()) # plot the state-value function plot_blackjack_values(V)


Finally, we visualize the policy that is estimated to be optimal.
# plot the policy plot_policy(policy)

The true optimal policy $\pi_∗$ can be found in Figure 5.2 of the textbook (and appears below). Compare your final estimate to the optimal policy – how close are you able to get? If you are not happy with the performance of your algorithm, take the time to tweak the decay rate of ϵ, change the value of α, and/or run the algorithm for more episodes to attain better results.

Full Code:
import sys import gym import numpy as np from collections import defaultdict from plot_utils import plot_blackjack_values, plot_policy env = gym.make('Blackjack-v0') print(env.observation_space) print(env.action_space) """ Tuple(Discrete(32), Discrete(11), Discrete(2)) Discrete(2) """ for i_episode in range(3): state = env.reset() while True: print(state) action = env.action_space.sample() state, reward, done, info = env.step(action) if done: print('End game! Reward: ', reward) print('You won :)\n') if reward > 0 else print('You lost :(\n') break def generate_episode_from_limit_stochastic(bj_env): episode = [] state = bj_env.reset() while True: probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8] action = np.random.choice(np.arange(2), p=probs) next_state, reward, done, info = bj_env.step(action) episode.append((state, action, reward)) state = next_state if done: break return episode for i in range(3): print(generate_episode_from_limit_stochastic(env)) def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0): # initialize empty dictionaries of arrays returns_sum = defaultdict(lambda: np.zeros(env.action_space.n)) N = defaultdict(lambda: np.zeros(env.action_space.n)) Q = defaultdict(lambda: np.zeros(env.action_space.n)) # loop over episodes for i_episode in range(1, num_episodes+1): # monitor progress if i_episode % 1000 == 0: print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="") sys.stdout.flush() ## TODO: complete the function return Q # obtain the action-value function Q = mc_prediction_q(env, 500000, generate_episode_from_limit_stochastic) # obtain the corresponding state-value function V_to_plot = dict((k,(k[0]>18)*(np.dot([0.8, 0.2],v)) + (k[0]<=18)*(np.dot([0.2, 0.8],v))) \ for k, v in Q.items()) # plot the state-value function plot_blackjack_values(V_to_plot) def mc_control(env, num_episodes, alpha, gamma=1.0): nA = env.action_space.n # initialize empty dictionary of arrays Q = defaultdict(lambda: np.zeros(nA)) # loop over episodes for i_episode in range(1, num_episodes+1): # monitor progress if i_episode % 1000 == 0: print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="") sys.stdout.flush() ## TODO: complete the function return policy, Q # obtain the estimated optimal policy and action-value function policy, Q = mc_control(env, ?, ?) # obtain the corresponding state-value function V = dict((k,np.max(v)) for k, v in Q.items()) # plot the state-value function plot_blackjack_values(V) # plot the policy plot_policy(policy)
댓글을 달려면 로그인해야 합니다.