Given a state in a zero sum game, how do we find an optimal policy? In theory, this is simple, because we could just perform a brute force search, and going through all the possible moves and all the possible games that can be played, and then we can choose the ones with the best possible outcomes. The possibilities are numerous, so I’m only showing a few examples here. We already see that in practice, this brute force method can become infeasible. Even for tic-tac-toe, things are complicated, as there are nine possible first moves, eight possible second moves, seven possible third moves and so forth. Can we do better? One possibility is that we can randomly sample a subset of all the possible games. Then for each action, we can compute the expected outcome by taking the average of all the subsequent play outs like this. For convenience we also add an extra negative sign so that the expected outcome is from the perspective of the current player. Then we would choose the action with the largest expected score, in this case who plays across at the bottom corner. This procedure sounds a little bit familiar. Trying to compute the expected outcome is analogous to estimating the Q function, and we know that a completely random search isn’t ideal. It would be better to have some guidance. So, we can better balance random explorations with exploitations. Let’s get systematic then, and look at all the possible moves player two can take given the current state. There are six possibilities. We can think of all the possibilities as part of a tree structure. The goal is we want to look at each branch of this tree and focus most of the random play outs on the actions that are more promising as opposed to just sample completely randomly. To do this we define three numbers for each branch of this tree: U, N, and V. All the values are initialized to zero. N is the number of times I’ve played a random game starting from that branch, and V is an estimate of the expected outcome starting from that branch and from the perspective of the current player. The variable U is computed from V and N given by the formula on the right. In each iteration I need to choose to play a random game starting from a selective branch. This is where the variable U comes in. The branch with the highest value of U will be chosen. This U function contains two pieces. The first one is V. So when the current expected outcome is large, U will be large, and this helps exploitation. Meanwhile the second piece is inversely proportional to one plus N. So, it encourages visits to branches with low visit counts, and this helps with exploration. Now that we have defined all the variables, let’s see how a search may play out in action. Initially, all the U’s are zero, so we just randomly choose one of the six possible actions. Now, we play a random game, and this time player two lost. So, V is updated to be negative one, and the number of visit increased to one. Before we start the next gameplay, notice that the total number of gameplay is now one. So, the exploration part of the U function needs to be updated for all the other branches, like this. The largest U is now one, so going to the next iteration of gameplay, we need to choose one of the other five actions. We can then repeat this procedure until a fixed number of total gameplays, say 100 games is reached, and we might end up with something like this. How do we decide what the next move is then? We could choose the action with the highest U or V, but sometimes they might not be that reliable, especially if the visit counts are low. So instead, we choose an action with the highest visit counts, which is usually associated with good expected outcome anyway, and it’s a little bit more reliable. In this case the highest visit count is 21, and this is the move that we will choose according to the tree search.