Here, we will explain why the policy gradient is noisy, and discuss ways to reduce this noise. The way we optimized the policy, is by maximizing the average reward, U Theta. To do that, we use gradient ascent. Mathematically, the gradient is given by an average of the terms in the parentheses, over all the possible trajectories labeled by Tau. Now, the number of trajectories could easily be over millions, even for simple discrete problems, an infinite for continuous problems. So, for practical purposes, we simply take one trajectory, and compute a gradient, and then update our policy. A lot of times, the results of a sample trajectory, simply comes down to chance, and doesn’t [inaudible] contain damage information about our policy. How does learning happen then? Well, the hope is that after training for really long time, these tiny signal accumulates. Still, it would be great if we could reduce these random noise in the sample trajectories. The easiest option is to simply sample more trajectories. Using distributed computing, we can even collect these trajectories in parallel, so that it won’t take too much time. Then we can estimate the policy gradient, by simply averaging across all the different trajectories, by the formula G here. There’s another bonus for running multiple trajectories, is that we can collect all the total rewards, and get a sense of how they are distributed. In many cases, the distribution of rewards shifts, as learning happens. An episode with total reward equals one might be really good early on in the training, but really bad after a thousand training episodes. Learning can be improved if we normalize the rewards like this, where Mu is the mean, and Sigma the standard deviation. Dispatch normalization technique is also used in many other problems in AI, such as image recognition, where normalizing the input, can improve learning.