6 – M3 L2 C07 V3

So far, you’ve learned about a couple of different algorithms that we can use to optimize the weights of the Policy Network. Hill climbing begins with a best guess for the weights, then it adds a little bit of noise to propose one new policy that might perform better. Steepest-ascent Hill climbing, does a little bit more work by generating several neighboring policies at each iteration. But in both cases, only the best policy prevails. For Steepest-ascent Hill climbing, there’s a lot of useful information that we’re throwing out. In this video, you’ll learn about some methods that leverage useful information from the weights that aren’t selected as best. So what if, instead of selecting only the best policy, we selected the top 10 or 20 percent of them, and took the average? This is what the Cross-Entropy Method does and you’ll look at an implementation later in the lesson. Another approach is to look at the return that was collected by each candidate policy. The best policy will be a weighted sum of all of these, where policies that got higher return, are given more say or get a higher weight. This technique is called Evolution Strategies. The name originally comes from the idea of biological evolution, where the idea is that the most successful individuals in the policy population, who have the most influence on the next generation or iteration. That said, it’s best to think of evolution strategies as just another black box optimization technique. In the upcoming concepts, you’ll have the chance to compare some of these approaches.

%d 블로거가 이것을 좋아합니다: