2 – M3L3 C02 V6

By the end of the last video, we had discussed a game that we’d like to teach an agent to play. There were four possible actions corresponding to moving up, down, left, or right. The output layer of our neural network had a node for each possible action. The weights begin with initially random values and we can use the corresponding policy to play the game for an episode. Say, the reward is delivered only at the end of the game and is positive one if the agent wins, and negative one if the agent loses. So how can we use this information to improve the network weights to get us closer to the optimal policy? For now, say that we collected a single episode for the agent one. Well, the policy gradient method that we’ll discuss in this lesson. We’ll look at each state action pair separately, beginning with the first one, and we’ll recall how the agent ultimately selected action left from the state. It just passed that state through the network which returned action probabilities. The agent then sampled from those probabilities which ultimately led to selecting action left. So the idea is this, since the agent won the game, that’s an indication that it was a good decision to select action left when in this game state. So we can change the network weights just a little bit to make it even more likely to select action left from that game state in the future. Then, we move on to the next state action pair and look at the probabilities that lead to selecting action up, and we amend the network weights again, just a little, to make it slightly more likely to select action up from the corresponding game state. Once we’ve done all of those updates for every state action pair in the episode, we can collect another episode. Say, in the second episode we lost. We’ll again consider each of the state action pairs one at a time and begin with the first one. Say, the action probabilities corresponding to the state are given here. Then it makes sense that since this choice to select action up was part of an episode where we eventually lost the game, we’ll amend the network weights to now put less probability on that action. We’ll do the same for all other state action pairs in the episode where we want to amend the network to make it less likely to repeat these bad decisions in the future. We’ll continue with collecting more episodes and making these modifications to the network. But that’s it. In this lesson, we’ll dig more deeply into this process but it’s useful to keep the big picture in mind. We just collect a lot of episodes and then for each episode we amend the network weights to make all of the state action pairs more likely if we won the game, and to make them all less likely if we lost the game. This method isn’t perfect but it’s a start. Later in this lesson we’ll learn about some ways to improve it.