# 6 – Alpha Zero 2_ Self-Play Training

Now that we have an improved Monte-Carlo Tree Search guided by an expert policy and critic, how do we update them? Well, start with an empty board of Tic-Tac-Toe, we perform Monte-Carlo Tree Search using the current policy and critic. The end result is a list of visit counts for each actions N sub a, which can be converted into a list of probabilities for each action at every time step. After choosing the first move, we can perform Monte-Carlo Tree Search again. Now, we don’t have to start from scratch because this current state is likely to have a very high visit counts and many of the subsequent states should have been explored already. So, we can iterate the Monte-Carlo Tree Search algorithm fairly efficiently. Eventually, we repeat this process and arrive at the end gain. In this case, the final score is z equals plus one. The final outcome can be compared to the expected outcome computed at each time step using the critic. We also computed the probability of performing an action through the Monte-Carlo Tree Search at each time step, and that can be compared to the expert policy as well. Together, we can compute a loss function, L Theta. This L Theta contains two terms. The first term is the square of the difference between the prediction from the critic and the actual outcome. Notice that I’ve included a factor of negative one to the t to make sure that the comparison is always from the perspective of the current player. The second term involving the logarithm of the policy can be interpreted as an entropy loss between the current policy and the probabilities computed through Monte-Carlo Tree Search. Minimizing this term, we forced the policy to be closer to the results of Monte-Carlo Tree Search, and thus, strengthening the moves predicted by the policy. Using the loss function, we can perform gradient descent to update both the policy and the critic. Finally, we’re able to summarize the AlphaZero algorithm. First, we initialize a neural network that includes both the critic and the policy. Then we play again using Monte-Carlo Tree Search for choosing moves. Afterward, we compute the loss function and perform gradient descent. Repeating step two to three, the agent is able to play better and better. You may ask, how does learning happen intuitively? Starting from random critic and policy, the Monte-Carlo Tree Search in AlphaZero should be no better than a standard Monte-Carlo Tree Search. How does it learn then? Well, in the beginning, the critic is able to improve because whenever we reach end game during the Tree Search, the end game outcome is propagated in the tree and the critic will be able to predict this outcome better and better. After the critic becomes less random, the policy will start to improve as well. As training goes on, the AlphaZero agent will first learn how to play the end game very well. As the end game improves, the mid game will improve as well. Eventually, the algorithm will be able to anticipate long sequences of expert moves, leading to an expert level of gameplay