7 – TD Control_ Expected Sarsa

So far, you’ve implemented Sarsa and Sarsamax and we’lll now discuss one more option. This new option is called expected Sarsa and it closely resembles Sarsamax, where the only difference is in the update step for the action value. Remember that Sarsamax or Q learning took the maximum over all actions of all possible next … Read more

6 – TD Control_ Sarsamax

So far, you already have one algorithm for temporal difference control. Remember that in the Sarsa algorithm, we begin by initializing all action values to zero in constructing the corresponding Epsilon Greedy policy. Then, the agent begins interacting with the environment and receives the first state. Next, it uses the policy to choose it’s action. … Read more

5 – TD Control Sarsa Part 2

We began this lesson by reviewing Monte Carlo Control. Remember this was the corresponding update equation. In order to use it, we sample a complete episode. Then, we look up the current estimate and the Q table, and compare it to the return that we actually experienced after visiting the state action pair. We use … Read more

4 – TD Control Sarsa Part 1

In this video, we’ll discuss an algorithm that doesn’t need us to complete an entire episode before updating the Q-Table. Instead, we’ll update the Q-Table at the same time as the episode is unfolding. In particular, we’ll only need this very small time window of information to do an update, and so here’s the idea. … Read more

3 – Quiz_ MC Control Methods

In this lesson, we’ll draft several new algorithms to solve the reinforcement learning problem. We’ll begin by reviewing how Monte Carlo Control works using our small grid world example. Remember that we keep track of a Q-table, it contains for each state action pair the return that we expect to get. To update the Q-table … Read more

2 – L602 Gridworld Example RENDER V2-2

To illustrate the algorithms we’ll discuss in this lesson, it’ll help to work with a small example of a reinforcement learning task. So, say we have an agent in a world with only four possible states, here, marked by stone, brick, wood, or a grass. Say that at the beginning of an episode, the agent … Read more

1 – Introduction

In this lesson, you will learn about Temporal Difference or TD learning. In order to understand TD learning, it will help to discuss what exactly it would mean to solve this problem of learning from interaction. The solution will come many years into the future, when we’ve developed artificially intelligent agents that interact with the … Read more

9 – L612 Epsilon Greedy Policies RENDER V4

So, in general, when the agent is interacting with the environment, and still trying to figure out what works and what doesn’t in its quest to collect as much reward as possible, creating policies are quite dangerous. To see this, let’s look at an example. Say you’re an agent, and there are two doors in … Read more

8 – L611 Greedy Policies RENDER V4

So far, you’ve learned how an agent can take a policy like the equal probable random policy, use that to interact with the environment, and then use that experience to populate the corresponding Q-table, and this Q-table is an estimate of that policies action value function. So, now the question is how can we use … Read more

7 – MC Prediction – Solution Walkthrough

Hello. Once you finish writing your implementation, you can click on Jupyter on the top left corner. That’ll bring you to a list of files on the left here. You can open Monte_Carlo_Solution.ipython notebook to see one way that we solved the exercise. So, what I did first was I restarted the kernel and cleared … Read more

6 – L606 MC Prediction Part 3 RENDERv1 V4

Before you read the pseudo code, there’s a special case we have to discuss. What if, in the same episode, we select the same action from a state multiple times? For instance, say that at time step two, we select action down from state three, and, say we do the same thing at time step … Read more

5 – L605 MC Prediction Part 2 RENDER V3

So far we’ve informally discussed how we might take a bad policy like the equiprobable random policy, use it to collect some episodes and then use those episodes to come up with a better policy. Central to this idea, we build a table that stores the return obtained from visiting each state action pair and … Read more

4 – L604 MC Prediction Part 1RENDER V2

So far we’ve been working with a simple Grid World example with four states. We assume the agent used the equiprobable random policy to interact with the environment. The agent collected two episodes and now the question is, how exactly should the agent consolidate this information towards its goal of obtaining the optimal policy? Well … Read more

3 – L603 Monte Carlo Methods RENDER V3-2

So, we’re working with a small grid world example, with an agent who would like to make it all the way to the state and the bottom right corner as quickly as possible. So, how should the agent begin if it initially knows nothing about the environment? Well, probably, the most sensible thing for the … Read more

2 – L602 Gridworld Example RENDER V2-2

To illustrate the algorithms we’ll discuss in this lesson, it’ll help to work with a small example of a reinforcement learning task. So, say we have an agent in a world with only four possible states, here, marked by stone, brick, wood, or a grass. Say that at the beginning of an episode, the agent … Read more

13 – M1 L6 S2 V1

So, at this point you’re about to implement or you just implemented your first R method that can help an agent recover the optimal policy for an environment. You should be proud of all of your hard work and it’s about to pay off. Specifically, we’ll implement Constant-Alpha MC Control, we’ll also make sure that … Read more

12 – L617 Constant Alpha Edits RENDER V1

To understand how to set the value of alpha, we’ll look closely at the update equation. Note that alpha must be set to a number between zero and one. When alpha is set to one, the new estimate is just the most recent return where we completely ignore and replace the value in the queue … Read more

11 – MC Control_ Constant-alpha

Currently your update step for Policy Evaluation looks a bit like this. You generate an episode, then for each state-action pair that was visited, you calculate the corresponding return that follows. Then, you use that return to get an updated estimate. We’re going to look at this update step a bit closer with the aim … Read more

10 – L615 Incremental Mean RENDER V4

Say the agent interacts with the environment for four episodes, then say we focus on one state-action pair in particular which was visited in each episode, and we can record the return obtained by visiting that pair for each episode. So in episode one, after the pair was visited, the agent got a return of … Read more

1 – L601 Intro RENDER V2

Let’s recall the problem at hand. We have an agent and environment. Time is broken into discrete time steps, and at every time step, the agent receives a reward and state from the environment, and chooses an action to perform in response. In this way, the interaction involved is a sequence of states actions and … Read more