8 – Alphazero python classes walkthrough

Hello, welcome. In this screencast, I want to walk you through how I implement some of the gaming environment and the tree search environment so that in case you want to edit the files, you can get an understanding of how they are implemented. So let’s go to the ConnectN.py to look at how the game class is implemented. So, the important pieces this class, the ConnectN class. When it initialize, it keeps track of the size of the game board, the winning conditions, the state is a bunch of stuff with a bunch of zeros, it also saves the last move whether there’s a pie_rule, whether the pie_rule switch side was applied. So, I also supply a copy function that allows the class to be copied more quickly. This is very useful when using the Monte Carlo tree search to explore. Now, whenever a move is placed, the code will try to compute a score. To speed things up, it only computes scores from the last move. Given that if you play the game sequentially, you can’t win a game without involving the last move. So, what it does is it takes the last move and compute the horizontal, vertical, diagonal lines around that last move to see how many of the same pieces are in a row and compute a score here. So, this is very fast, all this function. Now, there’s also a function that’s useful. It’s mostly for rendering that when the game has been finished, a player has won, it will return precisely the location of the pieces that won the game. So this is useful for displaying the graphics. Now, you can see the moves here is the function that advanced the game by one step. Now, it’s fairly straightforward it checks whether the locations are in range and whether the current location is zero. Although there is one subtlety here is the pie_rule. So, when the first player makes a move, the second player can overtake that piece. The way I’ve implemented is that, in the first move, if pie_rule is enabled, the first player will place instead of one, it will place 0.5, half, on the location and this code checks for that. If the second player decide to overtake that place, the 0.5 will become a negative one. Otherwise, the original move, 0.5, will turn into a legitimate piece plus one. There are also some useful function, available_moves just tells you which pieces are available. So that’s it for the ConnectN file. Let’s go back. Now, let me go to the Play.py file that does the rendering. There’s quite a bit of bookkeeping, and here there are a lot of functions that utilize matplotlib plotting functions. Now, the interesting part is when you initialize a class, it automatically calls play which tackles the rendering. Now, if one of the players none or one of them is human, I have to add in an interactive element when a button is clicked such that a move is registered. So, I have to do this when there’s a human player in in the play class, and you can check out the click function that will display a circle in a move. There’s also a draw_winner. So, every time you make a move, the game checks whether someone won, and then the draw winner will put an extra marker on the winning pieces to indicate that the game ends in a win and loss. Now, when both players are AIs, the function is slightly different. Instead of having an interactive element, I use a function animation that basically draws each move as each move is played. So you can check that out. You can also modify the coloring scheme, the sizes of things if you would like when you edit this function. Now, let’s go back and walk through the most important part of the code, the Monte Carlo tree search implementation. Now, the first subtlety is that I’ve included a bunch of transformation T0 to T7. Basically, when you have a square board and the board is symmetric under rotations and flips, the board shouldn’t change when I do different combinations of rotations and flip, and also the inversion operation. So, every time I put the game state to the neural network, I randomly choose one transformation together with one inverse transformation to apply to this state. This speed sub training by a little bit and it’s helpful. So, and this policy, this process_policy is a bridge between turning the policy in a game and compute the output that’s properly fed into the Monte Carlo tree search class. Now, let’s look at the implementation of the Monte Carlo tree. I’ve defined a node class. The node class contains children nodes, and U value, and the V value, and the N value. It also contains the V value evaluated from the neural network for bookkeeping. Also, probability evaluated from the neural network, and this is torch so such that later on when we compute the loss function for backpropagation. You can retrieve it from here. Now, a lot of this is bookkeeping wanted to know is that there’s this create_child function that will expand the current node into children and store it in the self.child member, and sometimes a children might be a winning move. So, in the create_child, it acknowledges that by telling the tree to always pick the winning move. So, all winning moves have U values equal to plus infinity, and that speed things up. Such that if there’s a winning move, I don’t have to explore all the other non-winning moves. It’s not necessary but it’s useful for training, especially given that training takes quite a while. Now, let’s look through the explore function. So, each time you explore, you have to play a new game simulated by the critic. So, you start from the top of the node. So, we start from the current node, it’s the self, and you then you keep going through the children, and then you pick the ones with maximum U. Sometimes there are multiple nodes with the same maximum U. In that case, I randomly choose an action, so random.choice. Now, as I mentioned before, there are some nice speedups we can obtain. For example, if all the moves are losing, then I’ve bookept it to be negative infinity so it knows that, and then it will convert the current U value to be infinite. You might wonder why if the maximum U is negative infinity, why that U is positive infinity? Well, because every time we go to the next level, the player gets changed. So, if right now, the maximum U for all the children are negative infinity, meaning they’re all losing, then it means previously the previous player made a really good move, and that’s a winning move for the previous player. So, I have this condition here and also update the V to be 1.0. Now, the same thing is true for when the maximum U is in infinite. It’s a winning move for the current player, but from the perspective of the previous player, that’s a losing game, given that the current player can win. So then, the current.U has to be negative infinity and also current.V is negative one. So this is not strictly necessary, but it saves time. Whenever we encounter winning move, we don’t have to explore the other possibilities. That’s useful. Now, as you go through this code, it’s essentially whenever I also keep these conditions when checking for winning conditions, and then you increment the V’s count, and you go backpropagation, you go back to the mother, increment the V’s count for the mother, and you also update the average V for the mother. Notice that there’s an extra negative sign here. This is because every time when I go to the next or previous layer, the player is switched and current.V is from the perspective of the current player. So, that’s why when I update the average, there is a negative current.V. Then after that, I can update all the U values in all the sibling branches. Now, onward to the next step, this next function basically choose the next action for the tree. You see that I find the maximum U. If the maximum U is infinite, that means there’s a winning move. I speed things up by telling the agent to always choose one the winning move, there’s no reason to not to. Now, when that doesn’t happen, then I have to look at the V’s count like this, and there’s a temperature parameter, it’s basically fixed to one by default so that the probabilities is simply proportional to the V’s counts. After that, it increments to the next step. You can see here the next day, I went on the choose the children based on the probability, the random.choices give you a list that has one element, so I just choose the zero element, and it outputs other useful values for the updating the policy. So, that’s it. That’s all the classes that are implemented for the AlphaZero agent.

%d 블로거가 이것을 좋아합니다: