Hey guys, you are here, I wanted to do a code walkthrough video. I wanted to show a very nice implementation of the A2C algorithm by Shangtong Zang. He’s a student in University of Alberta. He’s a student of [inaudible] Professor [inaudible]. He actually was the guy that ported all the code for the reinforcement learning an introduction book in the second edition from List to Python. After that, he started this repository with deep reinforcement learning algorithms like very very modular implementation of deep reinforcement learning algorithms in PyTorch. So to begin with, I’m just going to show you how to navigate his repository. In this file over here, this is the domain page in the examples.py file, you’ll see- well here you obviously you have the readme file, what to install and things like that and here you’re going to have overview of DQN, C51, A2C, N-Step, DDPG and so on, like the overview of the algorithms that he’s implemented. If you go to the examples file, open a tab over here, you’re going to see that he creates examples, like for example, DQN CartPole. So, the DQN algorithm solving the CartPole environment or the DQN solving the PixelAtari environment or the DQN solving their ram atari environment and so on. Many different algorithms and so on. So, it’s very modular the code, it’s very nice. Here for example you can see how he’s function, the function approximation he’s using here is VanillaNet with fully-connected body so because this is a CartPole, the input state is for variables, the position and the velocity and the change on those two. The body of the network does not need to be a convolutional neural network but it’s just a regular neural network and then VanillaNet I’ll show you what those look like in a minute. For example, here in the DQN PixelAtari, he’s actually using a VanillaNet on the top and then in NatureConvBody. So the ConvBody is basically the ConvNet convolution in neural networks that we’re introducing the nature paper by main and company. You can see how very nice and modular, he can reuse those, the same VanillaNet with a fully-connected body and so on. The one that we’re going to be looking at is down here A2C. So you can see there, A2C here has several ones, we have A2C for the CartPole, A2C for the pixelAtari, A2C for continuous and so on. So, I’m going to be walking through this, most most of the code not all of it but most at least to get you guys started. Here, you can see how he’s using a CartPole. Here he’s creating a log directory function, classic control name, blah, blah, blah, this is just for keeping track of things. If you remember, you just watch the actor, the A2C lecture, you know that the actor-critic methods say A3C and A2C use multiple agents workers. So, the parallel environments and agents gathering information. So, you see here number of workers is five. In this one you can see how the number of workers is 16. Okay, just to give you an idea of what this is. You also remember, so here you can see how he’s creating a parallelized task which is basically you want to create all these bunch of workers with this task function which is a classical control game, there’s going to pass it basically the log. So, you can look to a directories and so on, very nice, very modular code. Here you can see the optimizer is using is the Adam optimizer and the network is using is a CategoricalActorCritic network with a fully connected body because again, we’re using carpel so we don’t need anything other than that. If we were using Atari for example, we will be using a nature convolution init and then the same CategoricalActorCriticNet. Here, later, I’m going to talk about Generalized Advantage Estimation GAE. Here he’s setting this to false but you can set that to true and then you said this parameter here and then you’ll be using that other type both return. However, you remember probably from previous lectures that A2C uses an N-Step bootstrapping and here you can see how the rollout length is five. So, this is basically saying that he’s going to go five steps and then he’s going to back the value up, so it’s going to use a five-step bootstrapping in there. So, that’s the configuration. Let’s take a look real quick a couple of things. First the parallelize task, so we go to the task, here there’s a DeepRL component task. You go all the way to the bottom. You’re going to see ParallizedTask. Here you can see if he’s a single process, you do basically nothing, it’s just a dummy thing but if you have more processes, then you’re going to just create number of workers, number of tasks. In this case it’s going to call this class which we’re going to look in a minute but right now we have the state dimension, action I mentioned and so on. This is sort of like somewhat a wrapper for the opening IJim, but nevertheless it’s interesting code, try to dig into this. ProcessTask is a little bit up here. You can see how process task desk is actually using the MP library which it’s basically multiprocessing. You can see the multiprocessing library and then he’s going to create a pipe and then he’s going to communicate, going to have a bunch of workers communicate, send information back and forth. So, this is what is allowing the multi workers agent. So, that’s the ParallelizeTask. Now, we’ll move on to the network. The network is right here. You can see here we have a full connected body and then the CategoricalActorCritic. So, we have, in the network folder, so DeepRL. In the network folder we have network bodies, network heads, and network utils. So, the bodies are like a bunch of so these are the comm net, the is the foward function for that, DDPG, will talk about that later. The one that he’s using is the full connective body this one over here, which is it’s pretty straightforward just the layer linear depending on what you’re putting in now and stay dimensions and the hidden units 6464 and so on. So, negate is going to be a value by default this is also the activation function. So, these are pretty straightforward for network with the four function over there. So, regular things. So, we go back to the examples. I’m going to close this tab. Then here we can see so we have the full connect it. Then the head of that is a categorical actor-critic net. So, we go to the head and then here we can see a nice implementation. This is why I wanted to show this one at this point. Hopefully, you guys appreciate the good code as well and not just not simplicity only but also very nice implementation. All right. So, you have the actor-critic net here which you say base class we look at that in a second. But here you can see how the network has you get the body. The observation gets transformed into a tensor. Then you pass that through the body and then you split two heads into the actor and the critic. Then you get the logic and the values from this and here’s a categorical pass. Yeah. Log probability of the actions and then you’re going to return the action, the log probability and the entropy distribution and then the output of this critic. So, the volume function. We go up to actor-critic net which is here. Then you can see how here we had the actor-critic body which are the bodies that we initialized back then here. All right. So, the critic the actor, the joint body and then all the pass forward is here. So, back to this. So, here you see how the head is. The actor head is going to be basically the the actor body, feature dimensions and then the output is going to be the dimensions of the actions. Then it’s going to initialize it with some particular functioning here that’s initializing that layer. Then for the critic, the feature dimension is one output and it has that same initialization. So, that’s the network. Here’s the forward pass. You can study this a little bit more. This is the body. Then so the next thing to look at, is to the actual agent. So, the code that’s actually running the agent is an DeepRL. DeepRL agent has A2C agent in there. There you can see how he’s grabbing stuff from the config file. Then here’s a task function. Here’s the network function. He’s the optimizer function with the parameters and the total steps and everything you know initialized in it. From the steps, so depending on the royal length you’re going to basically collect that information. The rewards is there. So, this is the special case in case that you reach a terminal state before the roll out length. So, you want to handle that in a special way. Here’s some more processing for the roll out configuration, pending value. So, here you can see when he passes the state, I want to show you here. So, the network when he does a forward pass here and here, you can see how states normalising here the states. So, here’s some processing and they’re going on. If you want to know what that is, then you would have to go and you see this coming from the config. Go to the examples. We’ll look forward normalizer if there is anything declaring there, and this one for example I’ve seen that normalizer is the image normalizer. There’s probably something by default. So, you have to dig down into the config class to see what’s being used in there. Here, remember A2C. So, we’re using actually advantages for baselines. Here you are calculating the values. The TD errors right here. This isn’t the case that we’re using the generalized advantage estimation. So, it’s actually going to work also after you watch the GAE videos. You probably can understand what’s happening then in this few lines which is pretty cool then come back after that the GAE videos and then you can read this part over here. This is the different way for estimating the return. Then here’s the way that we’re doing that currently. So, the return, my value, detach. That’s the advantage. Yeah. So, here there’s a policy laws, the value laws, the entropy. Here is the initialized on the gradient and here is actual step of the optimizer that learning. All right. So, then this is basically synchronizing everything, getting everything all of the steps from the different workers and so on. So, yeah. Take a look at this in more detail. I think it’s a very nice code. So, very nice implementation and I hope that is useful to you. There are many other implementations out there. I found other implementations easier perhaps to understand. But nevertheless this is definitely high-quality implementation that’s worth looking at.