## 1-8-17. Coding Exercise

Coding Exercise Please use the next concept to complete the following section of Monte_Carlo.ipynb: Part 2: MC Control To reference the pseudocode while working on the notebook, you are encouraged to look at this sheet. Download the Exercise If you would prefer to work on your own machine, you can download the exercise from the DRLND GitHub repository. … Read more

## 1-8-16. Constant-alpha

Constant-alpha In the video below, you will learn about another improvement that you can make to your Monte Carlo control algorithm. Here are some guiding principles that will help you to set the value of $\alpha$ when implementing constant-$\alpha$ MC control: You should always set the value for $\alpha$ to a number greater than zero and less than (or equal … Read more

## 1-8-15. Incremental Mean

Incremental Mean In our current algorithm for Monte Carlo control, we collect a large number of episodes to build the Q-table (as an estimate for the action-value function corresponding to the agent’s current policy). Then, after the values in the Q-table have converged, we use the table to come up with an improved policy. Maybe … Read more

## 1-8-14. Exploration vs. Exploitation

Exploration vs. Exploitation Exploration-Exploitation Dilemma (Source) Solving Environments in OpenAI Gym In many cases, we would like our reinforcement learning (RL) agents to learn to maximize reward as quickly as possible. This can be seen in many OpenAI Gym environments. For instance, the FrozenLake-v0 environment is considered solved once the agent attains an average reward of 0.78 … Read more

## 1-8-13. MC Control

The Road Ahead You now have a working algorithm for Monte Carlo control! So, what’s to come? In the next concept (Exploration vs. Exploitation), you will learn more about how to set the value of $\epsilon$ when constructing $\epsilon$-greedy policies in the policy improvement step. Then, you will learn about two improvements that you can make to the … Read more

## 1-8-12. Epsilon-Greedy Policies

Correct! As long as epsilon > 0, the agent has nonzero probability of selecting any of the available actions.

## 1-8-11. Greedy Policies

Greedy Policies Correct! For state 1, action 2 has the highest estimated return (2>1). For state 2, action 1 has the highest estimated return (4>3).

## 1-8-9. Coding Exercise

Coding Exercise Please use the next concept to complete the following sections of Monte_Carlo.ipynb: Part 0: Explore BlackjackEnv Part 1: MC Prediction To reference the pseudocode while working on the notebook, you are encouraged to look at this sheet. Important Note Please do not complete the entire notebook in the next concept – you should only complete Part 0 and Part 1. … Read more

## 1-8-8. Workspace – Introduction

Workspace – Introduction You will write all of your implementations within the classroom, using an interface identical to the one shown below. Your Workspace contains the following files (among others): Monte_Carlo.ipynb – the Jupyter notebook where you will write all of your implementations (this is the only file that you will modify!) Monte_Carlo_Solution.ipynb – the corresponding instructor solutions plot_utils.py – … Read more

## 1-8-7. OpenAI Gym: BlackJackEnv

OpenAI Gym: BlackJackEnv In order to master the algorithms discussed in this lesson, you will write code to teach an agent to play Blackjack. Playing Cards (Source) Please read about the game of Blackjack in Example 5.1 of the textbook. When you have finished, please review the corresponding GitHub file, by reading the commented block in the … Read more

## 1-8-6. MC Prediction – Part 3

MC Prediction So far in this lesson, we have discussed how the agent can take a bad policy, like the equiprobable random policy, use it to collect some episodes, and then consolidate the results to arrive at a better policy. In the video in the previous concept, you saw that estimating the action-value function with … Read more

## 1-8-4. MC Prediction – Part 1

Important Note In this video, we demonstrated a toy example where the agent collected two episodes, consolidated the information in a table, and then used the table to come up with a better policy. However, as discussed in the previous video, in real-world settings (and even for the toy example depicted here!), the agent will … Read more

## 1-8-1. Review

Review Review Your Notes In the lesson The RL Framework: The Problem, you learned how to take a real-world problem and specify it in the language of reinforcement learning. In order to rigorously define a reinforcement learning task, we generally use a Markov Decision Process (MDP) to model the environment. The MDP specifies the rules that the environment … Read more

## 9 – L612 Epsilon Greedy Policies RENDER V4

So, in general, when the agent is interacting with the environment, and still trying to figure out what works and what doesn’t in its quest to collect as much reward as possible, creating policies are quite dangerous. To see this, let’s look at an example. Say you’re an agent, and there are two doors in … Read more

## 8 – L611 Greedy Policies RENDER V4

So far, you’ve learned how an agent can take a policy like the equal probable random policy, use that to interact with the environment, and then use that experience to populate the corresponding Q-table, and this Q-table is an estimate of that policies action value function. So, now the question is how can we use … Read more

## 7 – MC Prediction – Solution Walkthrough

Hello. Once you finish writing your implementation, you can click on Jupyter on the top left corner. That’ll bring you to a list of files on the left here. You can open Monte_Carlo_Solution.ipython notebook to see one way that we solved the exercise. So, what I did first was I restarted the kernel and cleared … Read more