In order to talk about AlphaZero, we first need to formalize the concepts of games that AlphaZero specializes in, Zero-Sum games. We start with a board game environment, a grid for example, then two competing agents take turns to perform actions to try to win the game. In the end, one agent’s win is another agent’s loss. Usually, we also assumed that the game contains no hidden information, so there’s no element of luck, and winning or losing is entirely determined by skill. This concept is applicable to games as simple as tic-tac-toe to more complicated games such as chess and go. Let’s go back to our tic-tac-toe example. The goal is to get three in a row, so the first player might make a move, say at the center, then the second player makes a move. Now, there are usually multiple possibilities at each step. In this case, there are eight. Focusing on one of these possibilities, we might have a game sequence like this. Now, in this case, the first player wins. We can represent all this information mathematically. The board can be represented by a matrix where zero indicates empty space and plus or minus one indicates the pieces of player one and player two. Given that there are only circles and crosses in tic-tac-toe, each entry can then only be zero, plus one, or minus one. We can also encode the final outcome by a score, where plus one indicates a win by the first player, and negative one indicates a win by the second player, and zero indicates a draw. This way of representing the board is convenient because the board can easily be fed into a neural network. Also, if you want to switch other player’s pieces, we can just multiply the matrix by negative one. We can also flip the score by multiplying it by negative one. This property will come in handy when we build an agent to play the game. Now that we’ve encoded the game mathematically, we can rephrase everything in the language of reinforcement learning. We have a sequence of states for the board game denoted by a sub t, and we have two players denoted by plus or minus one. Here, I’ve simplified the negative one to the power of t, assuming we start with t equals zero. Player plus one performs actions at all the even timesteps and tries to maximize the final score plus z, while player negative one performs actions at all the odd timesteps and tries to maximize negative one times the final score. Now, imagine we’ve constructed an intelligent agent who is able to play a perfect game as player plus one, then it should be able to play a perfect game as player negative one as well as long as we flipped the state, s of t, at all the odd timestep. Then we can have the agent play against itself with a common policy. Now, besides having a common policy, we can also have a common critic that can evaluate the expected outcome from the perspective of the current player. This essentially estimates the expected value of negative one to the t power times a score z. We will see later that this is the basic idea behind AlphaZero, where we have one agent playing against itself along with one critic that self-improves as more and more games are played.