With an objective function in hand, we can now think about finding a policy that maximizes it. An objective function can be quite complex. Think of it as a surface with many peaks and valleys. Here it’s a function of two parameters with the height indicating the policies objective value j theta. But the same idea extends to more than two parameters. Now, we don’t know anything about this surface. So how do we find the spot where the objective value is at its maximum? Our first approach is to search for the best policy by repeatedly nudging it around. Let’s start with some arbitrary policy pi, defined by its parameters theta, and evaluate it by applying that policy in the environment. This gives us an objective value, so you can imagine the policy lying somewhere on the objective function surface. Now we can change the policy parameters slightly so that the objective value also changes. This can be achieved by adding some small Gaussian noise to the parameters. If the new policies value is better than our best value so far, we can set this policy to be our new best policy and iterate. This general approach is known as hill climbing. You literally walk the objective function surface till you reach the top of a hill. Simple, isn’t it? The best part is that you can use any policy function. It does not need to be differentiable or even continuous, but because you’re taking random steps, this may not result in the most efficient path up the hill. One small improvement to this approach is to choose a small number of neighboring policies at each iteration and pick the best among them. It’s easier to understand this by looking at a contour plot. Again, starting with an arbitrary policy, evaluate it to find out where it lies. Generate a few candidate policies by perturbing the parameters randomly and evaluate each policy by interacting with the environment. This gives us an idea of the neighborhood of the current policy. Now, pick the candidate policy that looks most promising and iterate. This variation is known as steepest ascent hill climbing and it helps reduce the risk of selecting a next policy that may lead to a suboptimal solution. You could still get stuck in local optima. But there are some modifications that can help mitigate that, for example, by using random restarts or simulated annealing. Simulated annealing uses a predefined schedule to control how the policy space is explored. Starting with a large noise parameter, that is a broad neighborhood to explore, we gradually reduce the noise or radius as we get closer and closer to the optimal solution. This is somewhat like a kneeling and iron. By heating it up and then letting it cool down gradually. It allows iron molecules to settle into an optimal arrangement resulting in a hardened metal. Hence the name. We can also make our approach more adaptive to the changes in policy values being observed. Here’s the intuition, whenever we find a better policy than before, we’re likely getting closer to the optimal policy. So it makes sense to reduce our search radius for generating the next policy. This translates to reducing or decaying the variance of the Gaussian noise we add. So far it’s just like simulated annealing. But if we don’t find a better policy it’s probably a good idea to increase our search radius and continue exploring from the current best policy. This small tweak to stochastic policy search makes it much less likely to get stuck, especially in domains with a complicated objective function.