0

I'm trying to implement AlphaZero on a new game using this repository. I'm not sure if they are handling the MCTS search tree correctly.

The logic of their MCTS implementation is as follows:

  1. Get a "canonical form" of the current game state. Basically, switching player colors because the Neural Net always needs the input from the perspective of player with ID = 1. So if the current player is 1, nothing changes. If the current player is -1 the board is inverted.
  2. Call MCTS search. Source code
  3. In the expand-step of the algorithm, a new node is generated like this:
next_s, next_player = self.game.getNextState(canonicalBoard, 1, a)
next_s = self.game.getCanonicalForm(next_s, next_player)

"1" is the current player and "a" is the selected action. Since the input current player is always 1, next_player is always -1 and the board always gets inverted.

The problem occurs once we hit a terminal state:

  • Assume that action a ends the game
  • A next state (next_s) is returned by the "getNextState" method, next_player is set to -1. The board gets inverted one last time (1 becomes -1, -1 becomes 1). We now view the board from the perspective of the loser player. That means that a call to getGameEnded(canonicalBoard, 1) will always return -1 (or 0.0001 if it's a draw). Which means we can never observe a win for the player with ID 1.
  • The getGameEnded function is implemented from the perspective of player with ID = 1. So it returns +1 if player with ID 1 wins, -1 if player with ID 1 loses.

My current understanding about MCTS is that we need to observe all possible game ending states of a two player zero-sum game. I tried to use the framework on my game, and it didn't learn or get better. I changed the game logic to explicitly keep track of the current player id so that I can return all three possible outcomes. Now, at least it seems to learn a bit, but I still think that there is something wrong.

Questions:

  • Could this implementation theoretically work? Is it a correct implementation of the MCTS algorithm?
  • Does MCTS need to observe all possible outcomes of a two player zero-sum game?
  • Are there any obvious quick fixes of the code? Am I missing something about the implementation?
desertnaut
  • 57,590
  • 26
  • 140
  • 166
House92
  • 9
  • 1

1 Answers1

0

Conceptually the implementation in the linked repo is correct. The evaluation of the state is not checked until we recurse 1 more time to the perspective of the losing player, but as soon as that backs up 1 level the last player to move it is viewed as a win for the last player to perform an action and that will back up the tree all the way swapping back and forth to the current state of the real game which will return the correct value.

This does represent all possible outcomes. The outcomes are player 1 wins, player 2 wins or draw. In the case of a draw, it just returns something close to zero. In the case that player 1 wins, player 1 made the last move and then we recurse to player 2 who is in a losing evaluation. In the case that player 2 wins, player 2 would have made the last move and then we recurse once more to where player 1's evaluation is a loss.

It should be noted that it is possible for there to be a game where you move last and lose and in that case it is still correct!


If you wrote your own game rules and are trying to get this to work for your game, it's best to make sure your implementation adheres to the assumptions made by this implementation, i.e. the evaluation is always from the position of the active player and that your evaluation function is actually zero sum.

Nick Larsen
  • 18,631
  • 6
  • 67
  • 96
  • Thanks! In my game it is possible that a player has to make two consecutive moves (it's Nine Mens Morris - if a player closes a mill he has to make another move to take an opponents stone) - this means that I cannot naivly swap the game value. I guess it should be possible though. What exactly do you mean by "evaluation is always from the position of the active player"? I implemented it that way but the method always gets called from the perspective of player 1 and the board is always from the perspective of player 1. I could maybe keep track of the current player – House92 Aug 02 '22 at 18:26