5 – M3L3 C05 V2

As you’ve learned, we can express the expected return as a probability weighted sum, where we take into account the probability of each possible trajectory and, the return permits trajectory. Our goal is to find the value of theta that maximizes expected return. One way to do that is by Gradient Ascent, where we just iteratively take small steps in the direction of the gradient. This algorithm is closely related to Gradient Descent where the differences are that gradient descent is designed to find the minimum of a function, whereas gradient ascent will find the maximum. Gradient descent, steps in the direction of the negative gradient, whereas gradient ascent steps in the direction of the gradient. Remember that alpha is the step size and we’ll let it decay over time to avoid overshooting the target but now to apply this method will need to be able to calculate the gradient. Now, we won’t be able to calculate the exact value of the gradient since that is computationally too expensive. It’s computationally expensive because in order to calculate the gradient exactly, we’ll have to consider every possible trajectory. Instead of doing this, we’ll just sample a few trajectories using the policy and then use those trajectories only to estimate the gradient. Specifically, we’ll use the policy to collect end trajectories. Will denote those by tau1 tau2 all the way to tau m, where the trajectory number is in the superscript. Remember, any trajectory is just a sequence of states and actions and we’ll use this notation to refer to the eighth trajectory. Then, we’ll use these M trajectories to estimate the gradient, will do that by using the equation here. You’ll learn more about how to use that equation soon for now know that, it consolidates information from the M trajectories to yield an estimate for the gradient which we refer to as g hat. Once we have an estimate for the gradient, we can use it to update the weights of the policy. Then, we repeatedly loop over these steps to converge to the weights of the optimal policy. Now, before moving on, make sure that the high level details are clear. We’re just doing gradient ascent, where we work with an estimate of the gradient. But let’s look a bit more closely at this equation, to understand how it ties into the big picture that we’ve learned about at the beginning of the lesson. We’ll begin by making some simplifying assumptions that we will remove later. So for now, assume that we only collect a single trajectory, so m is equal to one. Furthermore, assume that the trajectory tau corresponds to a full episode. Then we can simplify the equation for the gradient to look like this. All of the superscripts have been removed because now we only have one trajectory. So let’s see how this equation accomplishes exactly what’s described in the big picture. Remember, we’re currently assuming that we collect a full episode which we referred to as tau. Then R of tau is just the cumulative reward from that trajectory. Remember that the reward signal and the sample game we’re working with, gives the agent a reward of positive one if we won the game and a reward of negative one if we lost. Recall that tau is just a sequence of states and actions. This term looks at the probability that the agent selects action a sub T from state s sub t. Remember pi with this theta and the subscript refers to the policy which is parameterized by theta. Then this full calculation here, takes the gradient of the log of that probability. This will tell us how we should change the weights of the policy theta, if we want to increase the log probability of selecting action a sub T from state s sub t. Specifically, if we nudge the policy weights by taking a small step in the direction of this gradient, will increase the log probability of selecting the action from that state and if we step in the opposite direction will decrease the log probability. So here’s where it gets really cool, because we’re ready to understand exactly how this equation fits into the big picture. So, this equation comes into play after we have collected a trajectory. It will tell us how to use the information in that trajectory to update the weights of the policy network. It will specifically take into account whether we have won or lost, and it will tell us how to do all of these updates all at once for each state action pair at each time step in the episode. To see this, assume that the agent won the episode. Then this R of tau is just positive one, which means we can effectively ignore it. So, we’re just multiplying each term in the sum by one which doesn’t do anything. So then, what the sum does is just add up all the gradient directions we should step in to increase the log probability of selecting each state action pair. If we take a step in the direction of g hat, that’s equivalent to just taking H plus one simultaneous steps where each step corresponds to a state action pair in the episode. It is designed to increase the log probability of selecting the action from that state. In the event that we lost, well, this R of tau becomes negative one, which ensures that instead of stepping in the direction of steepest increase of the log probabilities, we step in the direction of steepest decrease. I strongly encourage you to take your time here and make sure that this makes sense to you. But then the question becomes, how did this get more complex if we remove our original assumptions? Well, remember that this was the original equation. You’ll notice that it’s almost identical, where we now just need to add contributions from multiple trajectories. You’ll also note that there’s a scaling factor that’s inversely proportional to the number of trajectories. If you like, you can read more about how to derive this equation for yourself and the text that follows the video. But before digging into those details, make sure it’s clear to you how this equation fits into the big picture. Feel free to watch this video multiple times if needed and when you’re ready we’ll work with an implementation of this method.

%d 블로거가 이것을 좋아합니다: