9 – Alphazero advanced tictactoe walkthrough

Hi. Today, I’m going to show you how to use Alpha zero to train an agent to play a more advanced version of tic-tac-toe. Hopefully, by now you’ve gotten the chance to play with the basic version and successfully training a Alpha zero tic-tac-toe agent. This time, we’re going to initialize a slightly more complicated game that is played on a six by six board and in order to win, we need to have four pieces in a row, and also I’ve included an extra new rule called the pie rule which I’ll explain later. Now, let’s play a game, interactive game here. Now, the reason why a six by six board, four in a row game is not that interesting is because there’s a strong advantage for the first player. What I mean by is this if I’ve placed the first piece here, the second player probably needs to start defending right away. Say put a piece here. Now, if red puts another piece here, blue has to defend already because if red plays in other move, there’ll be both side, they’re both side, they can go to four piece in a row and red will win. So, blue has to defend this by placing a move here or there. So, this is more sensible because now blue can attack as well. Now, red can continue to do this, blue has to defend. Now, red has to block this and then blue has to block this and continue and so forth. You see that already. Blue has to block this move, but red has another move here that can get three in a row, so now blue will lose. But, the pie rule makes this interesting. Now, how it does that is that if the first player plays a very strong move, the second player can choose to change rows compare to player one or in a more simple way, player two can just place another piece on top of the piece of player one, say this. Now, player two has the advantage now. So, in order to avoid this, player one has to play an opening move. That’s quite a bit weaker, so that player two won’t take advantage of this. Now, this is called the pie rule. This turns the table around and basically give a slight advantage to player two given that player two can always choose the better of two option. Either to switch player with player one or place a new piece and player one continues. Now, the next step is to define a random policy. Now, this is very similar to the tic-tac-toe except they’re more convolutional layers and linear layers. There also the policy hat and the critic hat. Now, I’ve also played with leaky, rectified, linear unit. You can try rectified linear unit. I haven’t really tested which one’s better, but you can give it a shot and there’s also the availability matrix just like before and so you can initialize a policy like this. Now, we define our Monte Carlo Tree Search player that can be fed into the play class to play an interactive game. I recommend you to choose somewhere around a 1,000 or 2,000 for the tree search and this is definitely not going to be enough for a random policy. So, let’s initialize it. Now, let’s play a game against a random policy and see how we do. Now, the first this is more or less random, so I choose to play a strong move here. Let’s see what happens. So, this is more or less useless. See if red knows how to defend. Let’s see. Okay. It doesn’t know how to defend yet, and now I’m able to win. So, clearly this policy when it’s random, it hasn’t even learn. It doesn’t know how to defend this three in a row attack yet. So, it definitely needs to be trained to learn that and training does take quite a while. You can go through the training loop here, set up your optimizer, weight decay learning ray is the same as the tic-tac-toe as before. You can try different numbers of N max the number to explore the policy. I’ve also included an extra step here that saves you a little bit of time is that whenever I increment to the next tree, I only explore enough such that the branches, the total visit count for the branch is N max. This saves a little bit of time given that each time you go to the subsequent branch,some of the moves have been explored previously, so you don’t have to explore the entire 2,000. This saves a little bit of time and also I’ve added an extra condition that saves your policy periodically, so you don’t lose your progress, so feel free to edit this to be more frequent, less frequent, change the name of the policy. Now, training takes quite a while. It took me like a day or so to get my policy and I play with training a few different epoch and then at the end, I come up with a very strong policy that you can try to beat. So, let’s initialize this. I’ve included it in your Jupyter Notebook. You can see here the 6-6-4 pie rule policy. You can load it and play a game and see if you can beat it given that second player has an advantage. Let’s give it a shot. So, I’m going to try to be careful in playing this game and see if I can defeat this agent. Now, let’s play a game against the policy that I’ve trained given that it takes a bit of time for the agent to think for the next moves, I’m going to speed this part of the video here. So, the first player has played a first move. I think I overtake it. Now, as you can see, I was really trying to initiate an attack and try to defend and try to initiate another attack, but then I missed the fact that red already has some winning moves. So, you can see that this agent is quite strong and training seems to have succeeded. It’s pretty good at challenging the human player. So, I encourage you to play around with this and see maybe you come up with an even better policy.

%d 블로거가 이것을 좋아합니다: