# Workspace – Introduction

You will write all of your implementations within the classroom, using an interface identical to the one shown below. Your Workspace contains the following files (among others):

• Monte_Carlo.ipynb – the Jupyter notebook where you will write all of your implementations (this is the only file that you will modify!)
• Monte_Carlo_Solution.ipynb – the corresponding instructor solutions
• plot_utils.py – contains a plotting function for visualizing state-value functions and policies

The Monte_Carlo.ipynb notebook can be found below. To peruse the other files, you need only click on “jupyter” in the top left corner to return to the Notebook dashboard.

## The Workspace

Please do not write or execute any code just yet – for now, you’re encouraged to simply explore the file structure. In particular, make sure you know where to find the solution notebook. We’ll get started with coding within the Workspace soon!

# Monte Carlo Methods

In this notebook, you will write your own implementations of many Monte Carlo (MC) algorithms.

While we have provided some starter code, you are welcome to erase these hints and write your code from scratch.

### Part 0: Explore BlackjackEnv

We begin by importing the necessary packages.

import sys
import gym
import numpy as np
from collections import defaultdict

from plot_utils import plot_blackjack_values, plot_policy

Use the code cell below to create an instance of the Blackjack environment.

env = gym.make('Blackjack-v0')
print(env.observation_space)
print(env.action_space)

Execute the code cell below to play Blackjack with a random policy.

(The code currently plays Blackjack three times – feel free to change this number, or to run the cell multiple times. The cell is designed for you to get some experience with the output that is returned as the agent interacts with the environment.)

for i_episode in range(3):
state = env.reset()
while True:
print(state)
action = env.action_space.sample()
state, reward, done, info = env.step(action)
if done:
print('End game! Reward: ', reward)
print('You won :)\n') if reward > 0 else print('You lost :(\n')
break


### Part 1: MC Prediction

In this section, you will write your own implementation of MC prediction (for estimating the action-value function).

We will begin by investigating a policy where the player almost always sticks if the sum of her cards exceeds 18. In particular, she selects action STICK with 80% probability if the sum is greater than 18; and, if the sum is 18 or below, she selects action HIT with 80% probability. The function generate_episode_from_limit_stochastic samples an episode using this policy.

The function accepts as input:

• bj_env: This is an instance of OpenAI Gym’s Blackjack environment.

It returns as output:

def generate_episode_from_limit_stochastic(bj_env):
episode = []
state = bj_env.reset()
while True:
probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]
action = np.random.choice(np.arange(2), p=probs)
next_state, reward, done, info = bj_env.step(action)
episode.append((state, action, reward))
state = next_state
if done:
break
return episode

Execute the code cell below to play Blackjack with the policy.

(The code currently plays Blackjack three times – feel free to change this number, or to run the cell multiple times. The cell is designed for you to gain some familiarity with the output of the generate_episode_from_limit_stochastic function.)

for i in range(3):
print(generate_episode_from_limit_stochastic(env))


Now, you are ready to write your own implementation of MC prediction. Feel free to implement either first-visit or every-visit MC prediction; in the case of the Blackjack environment, the techniques are equivalent.

• env: This is an instance of an OpenAI Gym environment.
• num_episodes: This is the number of episodes that are generated through agent-environment interaction.
• generate_episode: This is a function that returns an episode of interaction.
• gamma: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: 1).

The algorithm returns as output:

• Q: This is a dictionary (of one-dimensional arrays) where Q[s][a] is the estimated action value corresponding to state s and action a.
def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0):
# initialize empty dictionaries of arrays
returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
N = defaultdict(lambda: np.zeros(env.action_space.n))
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# loop over episodes
for i_episode in range(1, num_episodes+1):
# monitor progress
if i_episode % 1000 == 0:
print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
sys.stdout.flush()

## TODO: complete the function

return Q

Use the cell below to obtain the action-value function estimate $Q$. We have also plotted the corresponding state-value function.

To check the accuracy of your implementation, compare the plot below to the corresponding plot in the solutions notebook Monte_Carlo_Solution.ipynb.

# obtain the action-value function
Q = mc_prediction_q(env, 500000, generate_episode_from_limit_stochastic)

# obtain the corresponding state-value function
V_to_plot = dict((k,(k[0]>18)*(np.dot([0.8, 0.2],v)) + (k[0]<=18)*(np.dot([0.2, 0.8],v))) \
for k, v in Q.items())

# plot the state-value function
plot_blackjack_values(V_to_plot)


### Part 2: MC Control

In this section, you will write your own implementation of constant-𝛼 MC control.

• env: This is an instance of an OpenAI Gym environment.
• num_episodes: This is the number of episodes that are generated through agent-environment interaction.
• alpha: This is the step-size parameter for the update step.
• gamma: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: 1).

The algorithm returns as output:

• Q: This is a dictionary (of one-dimensional arrays) where Q[s][a] is the estimated action value corresponding to state s and action a.
• policy: This is a dictionary where policy[s] returns the action that the agent chooses after observing state s.

def mc_control(env, num_episodes, alpha, gamma=1.0):
nA = env.action_space.n
# initialize empty dictionary of arrays
Q = defaultdict(lambda: np.zeros(nA))
# loop over episodes
for i_episode in range(1, num_episodes+1):
# monitor progress
if i_episode % 1000 == 0:
print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
sys.stdout.flush()

## TODO: complete the function

return policy, Q

Use the cell below to obtain the estimated optimal policy and action-value function. Note that you should fill in your own values for the num_episodes and alpha parameters.

# obtain the estimated optimal policy and action-value function
policy, Q = mc_control(env, ?, ?)
Episode 500000/500000.

Next, we plot the corresponding state-value function.

# obtain the corresponding state-value function
V = dict((k,np.max(v)) for k, v in Q.items())

# plot the state-value function
plot_blackjack_values(V)

Finally, we visualize the policy that is estimated to be optimal.

# plot the policy
plot_policy(policy)

The true optimal policy $\pi_∗$ can be found in Figure 5.2 of the textbook (and appears below). Compare your final estimate to the optimal policy – how close are you able to get? If you are not happy with the performance of your algorithm, take the time to tweak the decay rate of ϵ, change the value of α, and/or run the algorithm for more episodes to attain better results.

Full Code:

import sys
import gym
import numpy as np
from collections import defaultdict

from plot_utils import plot_blackjack_values, plot_policy

env = gym.make('Blackjack-v0')

print(env.observation_space)
print(env.action_space)

"""
Tuple(Discrete(32), Discrete(11), Discrete(2))
Discrete(2)
"""

for i_episode in range(3):
state = env.reset()
while True:
print(state)
action = env.action_space.sample()
state, reward, done, info = env.step(action)
if done:
print('End game! Reward: ', reward)
print('You won :)\n') if reward > 0 else print('You lost :(\n')
break

def generate_episode_from_limit_stochastic(bj_env):
episode = []
state = bj_env.reset()
while True:
probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]
action = np.random.choice(np.arange(2), p=probs)
next_state, reward, done, info = bj_env.step(action)
episode.append((state, action, reward))
state = next_state
if done:
break
return episode

for i in range(3):
print(generate_episode_from_limit_stochastic(env))

def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0):
# initialize empty dictionaries of arrays
returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
N = defaultdict(lambda: np.zeros(env.action_space.n))
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# loop over episodes
for i_episode in range(1, num_episodes+1):
# monitor progress
if i_episode % 1000 == 0:
print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
sys.stdout.flush()

## TODO: complete the function

return Q

# obtain the action-value function
Q = mc_prediction_q(env, 500000, generate_episode_from_limit_stochastic)

# obtain the corresponding state-value function
V_to_plot = dict((k,(k[0]>18)*(np.dot([0.8, 0.2],v)) + (k[0]<=18)*(np.dot([0.2, 0.8],v))) \
for k, v in Q.items())

# plot the state-value function
plot_blackjack_values(V_to_plot)

def mc_control(env, num_episodes, alpha, gamma=1.0):
nA = env.action_space.n
# initialize empty dictionary of arrays
Q = defaultdict(lambda: np.zeros(nA))
# loop over episodes
for i_episode in range(1, num_episodes+1):
# monitor progress
if i_episode % 1000 == 0:
print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
sys.stdout.flush()

## TODO: complete the function

return policy, Q

# obtain the estimated optimal policy and action-value function
policy, Q = mc_control(env, ?, ?)

# obtain the corresponding state-value function
V = dict((k,np.max(v)) for k, v in Q.items())

# plot the state-value function
plot_blackjack_values(V)

# plot the policy
plot_policy(policy)

이 사이트는 스팸을 줄이는 아키스밋을 사용합니다. 댓글이 어떻게 처리되는지 알아보십시오.