16 – DDPG Export V1

Hey guys, we got here. So, I wanted to show you now the DDPG. Now, remember DDPG’s somewhat, it’s questionable if is say an actor-critic or not. But nevertheless it’s a very important algorithm and the guys that created the algorithm said this is actor-critics. So, we’re going to just go with that. All right. So, DDPG it does has an implementation DDPG. It’s much easier to read. So, I wanted to show that one real quick and then go to Shangtong Zhang implementation as well. So, here in the DDPG pendulum, you can see the model. Pretty straightforward. There’s an actor here, there’s a critic here, and then here you can see how the action is being passed in the forward pass and then it’s getting concatenated to get the value of that state action pair. So, the actor-critic is basically doing a Q function which is a little bit different but that’s all good. All right. So, you can see how the forward pass here is just very regular. Here’s the initialization function and here’s the forward pass. You can see here how we’re using here a Tanh and that is because over the control we’re doing at control here pendulum. In the critic side here, you see how the state size, the very beginning is the state size and then you have the full connected one units plus the action size because this is why it’s going to have the concatenation over here. So, you do that for pass over there. I’m sorry, the forward pass is here. You’d do that, fc2 over here to this which is going to be used over here. Then it’s going to be a relu and then thus this is going to be passed to this third layer, which is this guy over here. Which is one output which is going to be a single value. So, that’s the model. Then we have the agent, I’m going to the Python notebook real quick because really doesn’t have much other than running the code but I wanted to show that. So, this is just a running basically, resetting the agent and then making the edge enact and this the agent step and then the environments step and so on. So, this is basically a runner even though the function is called the DDPG really doesn’t really have anything related to actually just DDPG. This is just running the DDPG algorithm which is being imported from the DDPG agent, the agent function. We’re going to do the DDPG agent and we have the agent class in there with the actor and the critic, the optimizers. So, this is the intro. We’re a couple interested in things that we talked about in the actor-critic lectures. So, the actor you have a local and a target and the critic has a local and a target. I call it normal. I mean it is just a regular network and then the target network. You do have a replay buffer here as you can see. There’s some process noise that this is the way that is another thing that was introduced. They put noise into a network to make it explore. So, this is step, you add the memory and this is just more like the actual replay memory code. In the acting code, you transform the state from NumPy to Torch and then you do an actor local. So, you’re grabbing that function and doing a forward pass on that state. Moving it to the CPU, getting the data and then getting into NumPy and then you set up for training and then you’re going to be adding noise to the action in this way and then we’re still going to be clipping between minus one and one. In the learn function, and I wanted to show you one of the things that we discussed in the lecture which is the TAU soft update right there. All right. So, the soft update function is as you can see here is basically grabbing all of the target model and the local model parameters and then it’s putting them in there and then so it’s going to do each other parameters. The matching parameters between the target and the local. It’s going to copy a mixture into this guy, into the target. So, the target network is basically receiving updates that are a combination between the local. So, the most up-to-date network and target and itself basically. Then, so TAU over there can be a very small number, and then in this case, I think TAU has a value of 0.001. So, basically, that’s basically grabbing like a very large chunk, like 99.9 percent of itself and then a very very small chunk of the other network. Here’s the noise class. You can look at that as well. I mean other than that, this algorithm is very very much like DQN. You have a replay buffer, you basically do the same interaction and all that. The only thing is how do we deal with the continuous variable, I’m sorry with yeah with a continuous action space, continuous actions? The way that you do that is just basically you have another optimizer that tries to learn the optimal action for any given state. Then using that optimal action, you can pass it through the other network to get the actual value. So, it’s pretty nice. Now, I’m just going to go and show you where the implementation of DDPG is a Shangtong’s repository. You can look in here. You have deep_rl. Remember something that the examples file usually helps you see the different ways that you can use this. So, DDPG low dimensional state and then DDPG pixel. You can see if you want to use it for the pixel ones like Torch or you want to do Atari, then you’d probably want to look into this. If you’re going to be using images, if you’re going to be using low dimensional state, like we’re doing here for the pendulum, then you can see how he has it here. So, preset for the pendulum. Then I guess in the end he selected a RoboSchoolHopper. But it’s basically the same thing. You can use any of these environments and it will work fine. So, the DeterministicActorCriticNet, you can see here the body of the actor is that. The body of the critic is a little bit difference, two-layer fully convolutional with action, interesting. All right. We probably need to look into that and see what that is. It’s probably exactly what we just discussed on the previous one. The random process is the noise that we talked about here. He has the full name, which I’m not going to even try to pronounce. But yeah, other than that you see a target network can make mix over here is 1e to the minus 3 which is basically the same that we had in the other one. So there are many similar things. There’s going to be a different style repository. That’s all. Let’s take a look real quick at the deterministic. So, the bodies you go deterministic. So, it was this one, two layer with action. So, here you can see how we have a similar bit of code, torch.cat, X, action and here you can see how the hidden size plus the action that I mentioned. So, it’s basically the same thing. Here in the heads, the head says that the DeterministicActionNet. So, we look for the DeterministicActionNet. We get here and here you can see that we’re using an ActorCriticNet which is basically the same network, actually same base as the A2C that I talked about in a few videos I well suppose. So, yeah, you can see here the deterministic action and he’s using that and then here you have the function for the actor operation optimization sorry and then the credit for the sub-optimization function as well, which is going to be passed in through this which again you go back to the examples and then you can see here how the optimizer is going to be somewhere in here. So, Adam, the learning rate in one is a little bit different. I think this is actually using similar hyperparameters than in the paper. I think in the paper they did use these two learning rates or they for sure use different one for the actor and one for the critic. So, maybe take a look at that. So, it’s a completely different implementation when you want to see the agent, this is the A2C. I go back. I can go back to the DDPG agent. Here’s the DDPG agent, you can actually look at it. There is a soft update here, there is an eval step, there is evaluation, some preparation for that. This is the regular learning steps. You can see here you take an action grab that normalizes state. You have to look sometimes the normalizer, doesn’t really normalize. Now, that we want normalizer, I’m actually clip the reward as well. You have to look into that. Replay buffer. You want to feed this experience tuples state action, so you send it over there and then if the replay buffer is greater than some memory size then you’re going to get into the learning steps. Then here you can see how you grab the experiences from the replay buffer. You unwrap that into the correct variables. So, you pass out and the target network through the target network and then here is the forward pass interestingly that is being done this way. But I think there’s a reason for that. You may need to dig into it a little bit more. So, here’s the discount, q_next, calculate the q_next and so on. The zero_grad and then the step of the optimizer for the critic in there, then the optimizer for the actor and then you update the network in a soft update matter. That is it. Well, I hope this is useful. This is a very nice implementation. I was actually delighted to see very nicely written code. So, I want to show this one as well. Hope this is useful for you guys and yeah. Have fun with your reinforcement learning journey.

%d 블로거가 이것을 좋아합니다: