7 – TicTacToe using AlphaZero – notebook walkthrough

Hi, welcome to the screencast. Today, I’ll share with you how to train an alphazero agent to play a game of TicTacToe. Before I go into the Jupiter notebook, let’s go back up that directory to check out all the files available to you in the workspace. So, you should see something like this, it will be working with the alphazero TicTacToe Jupiter notebook. I’ll cover the more advanced version later. The important files are these Python files. ConnectN.py implements a gameplay environment. It’s a class that contains the game state, the game score, the game player. MCTS implements a Monte-Carlo Tree Search algorithm for the alphazero. Inside this class, there are functions such as explore that explores the tree, you can pick a move from the tree, and so forth. The play.py class implements an interactive environment for you to play the game. So, you can play a game against yourself, just to see, get a feeling of the structure of the game. You can play a game against a given AI or you can watch two AI, play against each other to see who wins. Now, let’s go back to our Jupiter notebook here. The first step is we can initialize a game. To do that, we simply initialize a ConnectN object. The inputs are the board size and the winning condition. Now, the ConnectN can accommodate a much larger board size with much more winning conditions, but for the case of TicTacToe, we’ll go for the simplest non trivial game, a three-by-three board with a winning condition win three three in a row, so we can do that. Initialize the game. Now, what does the game object contain? It contains members such as the state, which is a three-by-three matrix depending on the board size, and the player, the player alternates as you perform moves. So, the first player is player plus one, and the second player is player negative one, and so forth. Also, it keeps track of the game score. Every time you make a move, it tries the computer game score. When the game is not over yet, the game score is simply none. If it’s zero, that means a draw has been reached. If it’s plus one, the first player wins. If it’s negative one, the second player wins. So, let’s see what happens when we make a move. Now, notice that the game that the player was player one, so when I make a move zero one, player one plays a piece at the zero, one location. The player is automatically advanced to negative one. The game has not ended yet still, so the score is still none and the state has been updated. So, what happens when I make a sequence of move like this? You can see that, now, player one has successfully created three in a row, so the game has been automatically updated to game score equals one and the game has ended. Now, the next step is, let’s try to play a game interactively. We can do that by first, supplying this matplotlib notebook magic command in Jupiter. Then, you can use the play cat class from the play.py file. To initialize it, you simply feed in a game environment like this, you can initialize a new one, and you can supply the player. When the players are none, basically means that they are from mouse click, their moves are registered by mouse clicks, although, you could supply different AI agent, and I’ll cover that later. So, let’s play a game against ourselves to see how it looks. So, you have a game board, and I can click, and play a piece in the middle, where I always starts first. Then, I can place another piece. If I get three in a row, I win and the game is terminated. You can play with this just got a feel. Now, the next step is we need to initialize an AI to play the game. To do that, I create a neural network using PyTorch and I call this the policy. Although in alphazero, the neural network can actually contains both the policy and the critic. This is what we’ll follow here as well. So, I’ve already prepared a neural network for you, it contains a convolutional layer, fully linear layer, and the important parts are they are separate layers for the policy and the critic after these convolutional and linear layers. I encourage you to play with this to try to come up with even simpler policy that can learn to play TicTacToe. So, next step is we look at the forward function. The way this works is it takes in first, you use the convolutional layer, the fully connected layer, then they’re separate policy head and critic head that needs to be computed separately. Now, you can easily use a Softmax on a nine-by-nine matrix to get the probability, but this doesn’t work quite that well. Given that not all the moves are available because say, player one plays a piece in a position, then player negative one could not place another piece in that same location. To fix it, I compute this thing called the availability matrix, which basically gives zero when the move is unavailable in that location, and it gives one when a move is available. So, during the Softmax operation, when I compute the exponential, I can multiply that by the availability matrix and then compute the probability. Doing it this way will ensure that when a move is not legal, the probability will be exactly zero. So, it will make coding up the rest of the functions a little easier. Now, the critic head is fairly straightforward, I just feed it through some fully connected layers and output a value using the tanh functions, so that the value is in between negative one and one. Now, for training to work, I have to initialize a policy using our favorite optimizer. A learning ray and also include a regularization weight decay and feel free to play around with the numbers here. The next step is I need to code up a player that can use Monte Carlo tree search to play a game. So, this is defined here. Basically, each time I have a game I make a copy and I initialize a Monte Carlo tree search class. You can explore this tree by simply invoking this explore function by supplying a policy. This function will compute all the U’s, pick the branch with maximal U, search, expand, back-propagate and then increase the VisiCalc and after that you can use the next function and tell the tree to choose a next move. We’ll cover what these are later, these are useful for training but this function is only used for visualization so we don’t care about these. Then, what it does is it returns the move that was played after this increment, incrementing the tree. Now you might wonder what is this temperature parameter doing here. Well remember that in our lessons we mentioned that in Monte Carlo tree search especially during training we choose the next action based on the VisiCalc, and to encourage exploration we choose a VisiCalc based on a probability that’s proportional to the VisiCalc. But when we play a game we want to find the strongest move. So we don’t really want to make it too random. To do that we include this temperature parameter. Basically when temperatures approaches zero we’re essentially choosing the node with dot largest VisiCalc. So 0.1 is that it’s more or less good enough so that’s why we put this temperature equals 0.1 here. I’ve also supply you a random player that just makes random moves. So let’s see what this function does, so given that you initialize a game, the game board is all zero, the policy player will spit out a move that gives you the corner on where to place the next piece, so 1, 2 in this case. So, let’s see how well this policy does by play a game against it, so player one is none so that’s me, so policy player will play against me so let’s see. So clearly if you know tic-tac-toe you know that this is not an optimal move. So this is more or less random and I can easily beat, you see? So it does know how to block given that there is a tree search component, so it’s able to find moves that will stop it from losing. So notice that it’s able to block me, but it doesn’t know slightly more advanced strategy namely that you shouldn’t put a piece in the edge middle here when I put a piece in the center. Now let’s try to improve this agent, to do that you can initialize your policy and your optimizer here, and your optimizer here, I think that’s redundant if given that I already done it here, but you can do it again and then you can train your agent. Let’s go to the training loop, here I’ve included a progress bar just to tell you the estimated time remaining. Now let’s go through the training loop, for each episode I have to initialize a top node for the Monte Carlo tree search and whenever the game is not over yet, meaning the outcome is still none, I explore the tree 50 steps, after I explore the tree, I keep track of the player and then increment to the next tree. So in this step the tree was to choose an action based on the VisiCalc and these are the outputs that will be used to update my policy. V is the expected outcome computed from the tree search but it’s not going to be useful here, nn v is the critic value evaluating the current board. P is a list of probability of taking each action computed from Monte Carlo tree search, nn p is the same thing but with dot Monte Carlo tree search basically coming straight from the policy. So we need to compute a loss function by comparing p with nn p and also comparing the critic value to the actual outcome of the game which will be recorded when the game is over in the dot outcome. Now let’s go through how to calculate the loss function. First we compute the probability times log policy. This will contribute to part of the loss function, I’ve also included a constant term here, this is not necessary at all, given that this does not contain any of the policy information and will not change the back propagation. But the reason why I include this term which is essentially log p times p is such that when the policy probability is exactly equal to the Monte Carlo tree search probability, this will give zero, and that’s not the case when you just compute probability times log policy. Doing it this way I can keep track of the loss function as training happens given that it’s not so easy to keep track of whether learning has occurred in alpha zero, since we’re performing self play, there’s no metric to really see that learning has happened. So the loss function is a useful proxy after adding in this constant. So that we know that the smaller this whole term is, the more we sort of roughly speaking has learned. I’ve also included the critic value in the list but I have to multiply that by the current player. The reason I do this is because whenever a critic evaluates a board, the outcome is with respect to the current player. But at the end of the day the game outcome is with respect to the first player only. So in order to compare apples to apples I have to do this multiplication. Now the next step is the game has ended, I’ve saved the outcome then I compute the loss function and it contains two terms; The first term is the difference squared of the prediction from the critic minus the actual outcome. Pretty straightforward, and the second term is the cross entropy loss that’s computed here and I’m just adding them up, then I can perform back propagation and update my policy. Now, I’ve already done the training, it shouldn’t take long at all, roughly a minute or two, you see that I’ve kept track of mean loss, the loss is sort of decreasing and outcomes even from the get-go is already mostly draws and player one wins a majority of the time. It’s not obvious that training was done, but given that during training the policy will still choose move somewhat randomly. We just have to play a game and see it if that’s learned. So let’s plot the losses and you see that it’s sort of decreasing so something has happened, now let’s play a game using the policy that we’ve just trained like this, initialize the game and put player one and player two both as the policy player MCTS, let’s see what happens. You see that the first player puts the piece in the center, which is the strongest position possible, the second player knows how to stop it by placing a piece at the corner. So it’s all sensible. Play a few more games. Now you see that it more or less leads to a draw. Let’s try something more interesting, let’s see how I do against this policy. If I place a piece at the center it blocks, and if I miss this it wins, obviously. Let’s try again, okay, so it’s pretty good. Now to make things even more interesting let it be the first player and see what happens if I place a suboptimal move like this. You see that it knows how to take advantage of the suboptimal move and it’s now guaranteed to win. Let’s try again, okay it consistently knows how to exploit my suboptimal move to win and there we go, we’ve trained an alpha zero agent to play a game of tic-tac-toe.