4 – M3L3 C04 V2

Now that we have the big picture of how the policy gradient method will work, we’re ready to get more specific. We’ll build slowly and carefully, and I strongly encourage you to keep the big picture in mind as the mathematical details unfold over the next several videos. The first thing we need to define is a Trajectory. A Trajectory is just a state action sequence. You can start to think of it as just a fancy way of referring to an episode where we don’t keep track of the rewards. But actually, a Trajectory is a little bit more flexible because there are no restrictions on its length. So, it can correspond to a full episode or just a small part of an episode. We denote the length with a capital H, where H stands for Horizon. We denote a Trajectory with the Greek letter Tau. Then, the sum reward from that Trajectory is written as R of Tau. Our goal in this lesson is the same as in the previous lesson. We want to find the weights Theta of the neural network that maximize expected return. One way of accomplishing this is by setting the weights of the neural network so that on average, the Egypt experiences Trajectories that yield high return. We denote the expected return by capital U, and note that U is a function of Theta. We want to find the value for Theta that maximizes U. U is defined in the expression here. To understand it, we’ll look at each part separately. First, recall that this R of Tau is just the return corresponding to an arbitrary Trajectory tab. So then, to take this quantity and use it to calculate the expected return, we need only take into account the probability of each possible Trajectory. That probability depends on the weights Theta in the neural network. This is because Theta defines the policy which is used to select the actions in the Trajectory, which also in turn plays a role in determining the states that the agencies. We use this notation with a semicolon only to indicate that Theta has this influence on the probability of a Trajectory. In the upcoming concepts, we work directly with this formula as we explore the details behind the policy gradients method.

%d 블로거가 이것을 좋아합니다: