We know that tree searches can become intractable very quickly, even when we utilize Monte Carlo methods. Take the game of Go for example. The game board is a 19 by 19 grid, and that means 361 possible first moves. The number of possible second moves is tiny bit smaller, 360. For the third move, 359. Already, there are around 50 million possibilities just for the first three moves. Now, there are some redundancies, as we can rotate and flip the board without changing the game. But still, a Monte Carlo search would never be able to accumulate enough statistics to be reliable. So, we need to find a better way to improve Monte Carlo tree search. Not surprisingly, this can be done using deep neural networks. Here, we introduce an expert policy, pi theta. That roughly tells us the probability of the next move by an expert player, and an expert critic, v theta, that tells us what the expected score is from the perspective of the current player. The idea is that using the policy, we can focus on exploring moves that an expert is likely to play. The critic will allow us to estimate the outcome of a game without running a simulation. In the AlphaZero paper, both the policy and the critic come from a single neural network, and thus, they share the same weights, theta. Now, we can choose a move using only the policy, but we don’t just want to use the policy, we want to find moves that are better than the current policy so that it can improve. To do this, we perform a Monte Carlo tree search, utilizing both the policy and the critic, so that the resulting moves are stronger than the policy alone, then we can use it to update the policy. Let’s see how that works in a game of Tic-tac-toe. Given a state, we want to find out what the best move is. Just like the case of Monte-Carlo tree search, we consider all the possible actions, and assign three numbers, U, N, and V to each of them. Just like before, N is the total visit counts for the action, V is the expected score, and U, a value that guides the Monte Carlo tree search. The U function is a bit different from the ones from normal Monte Carlo tree search. We have the same first term, V again, and it controls exploitation. We’ll see that this V will largely come from the critic. The second term is the exploration term that encourages visiting nodes with small visit counts. This is proportional to the policy. so that moves that are more likely to be played by an expert will be favored. Notice that we also introduced a hyperparameter C, that controls the level of exploration. Let’s see the search in action. In the beginning, all the U’s are zero. So we randomly choose an action. Then we can compute an estimated outcome using the critic, V theta, and update all the values in that node. The total number of visits is now one. So, the exploration terms for all the other branches need to be updated like this. To iterate this process, we need to compare all the U’s, and visit the branch with the largest U again. The search may lead us to a different branch. Again, we update the expected outcome using the critic, and the U function for the other branches. As this process is repeated, we may revisit the same branch again, what do we do then? In order to get more information, we can expand this node into all the possible subsequent states, each with their own values U, N, and V, all initialized to zero. Then we can choose an action with the maximum U, and compute the expected outcome from the critic. Afterward, we can go back to the original node and update the expected outcome. The V can be computed as the average of the initial estimate through the critic, together with the results of exploring all the children nodes. Notice that when computing the average for V, there’s a negative sign in front of the contributions from the children nodes. This is necessary, because by convention, V is an estimate from the perspective of the current player and going to a trial node, changes the player. After many iterations, and as we go deeper into the game tree, we might reach a terminal state where the game ends. Then, instead of using the critic for estimating the expected outcome, we simply use the actual outcome. We can repeat this process for a fixed number of games and total. Just like in Monte Carlo tree search, to choose the best moves, we can simply pick the action with the highest visit counts, or if we want to encourage more exploration, we can choose an action stochastically with a probability proportional to the number of visits for each action, like the equation here.