So far, we’ve been trying to frame the idea of a humanoid learning to walk in the context of reinforcement learning. We’ve detailed the states in actions, and we still need to specify the rewards. And the reward structure from the DeepMind paper is surprisingly intuitive. This line is pulled from the appendix of the DeepMind paper, and describes how the reward is decided at every time step? Each term communicates to the agent some part of what we’d like it to accomplish. So let’s look at each term individually. To begin, at every time step, the agent receives a reward proportional to its forward velocity. So if moves faster, it gets more reward, but up to a limit. Here denoted Vmax, but it’s penalized by an amount proportional to the force applied to each joint. So if the agent applies more force to the joints, then more reward is taken away as punishment. Since the researchers also wanted the humanoid to focus on moving forward, the agent is also penalized for moving left, right, or vertically. It was also penalized if the humanoid moved its body away from the center of the track. So the agent will try to keep the humanoid as close to the center as possible. At every time step, the agent also receives some positive reward if the humanoid has not yet fallen. They frame the problem as an episodic task where if the human falls, then the episode is terminated. At this point, whatever cumulative reward the agent had at that time point is all it’s ever going to get. In this way, the reward signal is designed, so if the robot focused entirely on maximizing this reward, it would also coincidentally learn to walk. To see this, first note that if the robot falls, the episode terminates. And that’s a missed opportunity to collect more of this positive reward. And in general, if the robot walks for ten time steps, that’s only 10 opportunities to get reward. And if it stays walking for 100, that’s a lot more time to collect more reward. So if we get the reward in this way, the agent will try to keep from falling for as long as possible. Next, since the reward is proportional to the forward velocity, this will ensure the robot also feels pressured to walk as quickly as possible, in the direction of the walking track, but it also makes sense to penalize the agent for applying too much force to the joints. This is because otherwise, we could end up with a situation where the humanoid walks to erratically. By penalizing large forces, we can try to keep the movements more smooth and elegant. Likewise, we want to keep the agent on the track and moving forward. Otherwise, who knows where it could end up walking off to. Of course, the robot can’t focus just on walking fast, or just on moving forward, or only on walking smoothly, or just on walking for as long as possible. These are four somewhat competing requirements that the agent has to balance for all time steps towards its goal of maximizing expected cumulative reward. And Google DeepMind demonstrated that from this very simple reward function, the agent is able to learn how to walk in a very human like fashion. In fact, this rewards function is so simple that it may seem that deciding reward is quite straightforward, but in general, this is not the case. Of course, there are some counterexamples to this. For instance, if you’re teaching an agent to play a video game, the reward is just the score on the screen. And if you’re teaching an agent to play Backgammon, well, the reward is delivered only at the end of the game, and you could construct it to be positive if the agent wins, and negative, if it loses. The fact, that the reward is so simple is precisely what makes this research from DeepMind so fascinating.