1

I have tried to make my own implementation of the Monte Carlo Tree search algorithm for a simple boardgame, and it seems to work reasonable while learning. However when I switch from playing to arena mode for evaluation, the mcts gets stuck in an infinite loop.

The reason for this is that while learning it pseudo-randomly picked actions based on their probability, but during arena mode this is switched to picking the most likely action to win. Unfortunately it seems that in one of the arena games this means that the game ends up in a loop, where a certain boardstate is reached and then after n actions that same boardstate is reached again, and again after each n actions...

I feel like I'm missing a component in the mcts algorithm that should prevent this from happening? or is this intended by mcts and is instead a fault of the boardgame, which should then have a draw mechanism built in to detect such things?

Tue
  • 371
  • 1
  • 14

1 Answers1

1

This can indeed happen in reinforcement learning. Another symptom can be agents not really trying to end the game/episode when they're easily able to do so and even "win".

Some possible solutions:

  • Modify the reward to give some small penalty to all agents (or only the winning agent) for longer games
  • Modify the environment to terminate after a fixed number of games with some fixed reward, maybe a draw with reward zero.

Combining both works too, with the latter acting as a failsafe and the former as a slight encouragement during the episode to try to make progress.

Todd Sewell
  • 1,444
  • 12
  • 26
  • I think one of the issues I also see is that it seems to also get stuck during the search. So when evaluating it enters the recursive search for a leaf node, which also manages to get stuck in a loop where the states repeat themselves and a leaf node is never found. – Tue Jan 12 '23 at 17:42
  • 1
    Just to clarify, this question is also about AlphaZero, right? You never do a recursive rollout there, you stop as soon as you find a new node and then you ask the NN about it. There is still an edge case where the search repeatedly visits a terminal node, just counting those visits towards the limit works to break that loop. In vanilla MCTS with random rollouts you can also limit the length of a rollout in some way to break potential infinite loops. – Todd Sewell Jan 12 '23 at 22:04
  • Yes it is still about AlphaZero. You state that you never do recursive rollouts when using an NN, but isn't it still a recursive rollout when you start at some state and start looking for a new node? I believe this is the part that the search algorithm gets stuck in. When looking for a new unexplored node for the neural network to evaluate it ends up getting stuck in a loop and never finds a new node. – Tue Jan 12 '23 at 22:27
  • 1
    I should have said you can never do a "potentially infinite" rollout. Either you end up at a new node, or you end up at a terminal node. You can never get stuck, since the nodes form a tree which can't have any cycles. – Todd Sewell Jan 13 '23 at 00:18
  • Ahh this is definitely something that I have done wrong then. In my game each node represents a game state, and each child node of a node represents a game state that can be reached by just one action. Each node is represented by a canonical representation of the board which does not include the turn such that if the board_state at turn 10 and 100 are identical then they are the same node. In my mind this makes sense since I want the mcts to realize that this is actually the same state that it has seen before, but I guess that isn't how a mcts actually works then... – Tue Jan 13 '23 at 00:35
  • 1
    You can use graphs like in [MCGS](https://arxiv.org/abs/2012.11045), but then there are some additional complications around Q value propagation. To break cycles they only combine nodes at the same depth, but I don't think that's optimal yet. – Todd Sewell Jan 13 '23 at 07:55
  • That was very illuminating, so rather than a tree search I made a MCGS but since I didn't add depth information to the representation my graph was not acyclic, which is how I ended up having all these loop problems. I definitely like the idea of the graph search, but I can see the issues they are talking about, so I think that for now I will try and make it into a simpler mcts, and then maybe revisit the mcgs approach again at some point in the future. – Tue Jan 13 '23 at 17:05