4 – Credit Assignment

In this video, we’ll learn how to modify the reward function so that we can better differentiate good versus bad actions within a trajectory. Going back to the gradient estimate, we can take a closer look at the total reward R, which is just the sum of reward at each step. Now, let’s think about what happens at time step t. Even before an action, a of t, is decided, the agent has already received all the rewards up until time step t minus one. So, we can think of that part of the total reward as a reward from the past, labeled by R past. The rest of the reward can be denoted by the future reward, R future. Now, because we have a Markov process, the action at time step t can only affect the future rewards, so the past rewards should not really be contributing to the policy gradient here. So, the properly assign credit to the action at time step t, we should just ignore the past reward, like this, so that a better policy gradient will simply have the future rewards as a coefficient, like this.

%d 블로거가 이것을 좋아합니다: