Starting with a state, we learned previously how to search systematically through one layer of a game tree using the variables U, N, and V. Can we generalize this to go deeper into the tree so that we can better anticipate a long sequence of moves? This is possible through what’s called expansion and back-propagation. Let’s see how it works in action. First, we choose an action that maximizes U. U is zero, so we randomly choose an action. We play a random game and this time to current player one. So, we update V to one, and the visit count to one. Afterward, the total number of gameplay needs to be updated as well. Then, all the U values in all the branches will need to be updated like this. For the next iteration, we choose the branch with the maximum U, 1.5 in this case. This time, the node has already been visited previously and we would like to get some more information by exploring deeper into the tree structure. So, instead of just playing another random game, we consider all the possible next moves. All the expanded nodes come with their own variables U, N, and V. Again, we choose the action with the maximum U, they’re all zero in this case. So, we randomly choose an action and play a random game. This time, the blue player won. So we update V and N. Now, the total number of visits for this cluster of nodes has been updated, so we need to update N total to one. This changes U for all the branches and so we need to update them. After all the second level nodes are up to date, we need to update the statistics on the previous node this procedure is called back-propagation, where we go back up the tree to update all the variables. First, the total number of visits, N, needs to be increased. Second, the expected outcome, V, needs to be updated. We replace it by the average outcome of all the games that are played from this node keeping in mind that the outcome is from the perspective of the current player, orange in this case. This gives N equals two, and V equals zero. Because in the previous play out, the blue player won. With the increase visit counts, N total needs to be updated as well, and then all the Us also needs to be updated. Now, we can repeat this process again and again until a fixed number of total games are played. We might get something like this for example. Then to choose the next move, we pick the node with the highest visit count, 21 in this case. One advantage of expansion and propagation is that if we want to utilize Monte Carlo tree search again after making a move, we don’t need to start from scratch. As we can reuse previous results by replacing the top node with the chosen node for the next move like this, and we can continue our Monte Carlo tree search. So, that’s it. We can finally summarize the Monte Carlo tree search algorithm. First, we initialize the top node, then we loop over N total times the following search procedure. Starting from the top node, which reverse to gain tree as far as possible by repeatedly going into the child node with the largest U. Next, if the node hasn’t been visited yet, we play a random game. Otherwise, we expand the node and play a game from one of the child node randomly. Next, we update statistics on that node based on the outcome of the gameplay. We also back-propagate the results to the parent nodes and update visit counts, and the dual variables if necessary. After the iteration is done, we simply pick the move with the highest visit counts. So, that’s it. This summarizes how can choose a move. Using Monte Carlo tree search on a zero-sum game. Generally speaking, the more random games we play during the search, the deeper we will search in the tree and the better the resulting moves will be.