4 – M3 L2 C05 V1

Now that we have a mental picture of how the hill climbing algorithm should work, we’re ready to dig into the pseudo-code. So remember, we begin with an initially random set of weights Theta. We’ll collect an episode with the policy that correspond to those weights, and then record the return which we’ll denote by capital G. At the start, this value for Theta is our first best guess for the weights, which we’ll record as Theta best. We’ll also record the return as the highest return we’ve gotten so far. Then, we’ll add a little bit of random noise to these weights to give us another set of candidate weights we can try out. We’ll refer to those new weights as Theta new. To see how good those new weights are, we’ll use the policy that they give us to interact with the environment. Then, will calculate the return we got which we refer to as G new. If the new weights gave us more return than our current best estimate, we update the best weights to that new value. Then, we just repeat or a loop over these steps until we solve the environment. That’s the complete algorithm. In the upcoming video, you’ll learn about some modifications we can make to improve it.

%d 블로거가 이것을 좋아합니다: