1

I'm creating a MCTS (Monte Carlo Tree Search) program for a 2 player game.

For this I create nodes in the tree, from alternating perspectives (the first node is from the perspective of player 1, any child nodes are from the perspective of player 2, etc.)

When determining the final move (after simulating many nodes) I pick the move that has the highest win chance. This win chance depends on the win chance in deeper nodes. For example assume I have 2 legal moves to make. For the first (call the associated node C1 - Child 1) I have done 100 simulations and won 25, while for the second (C2) I did 100 simulations and won 50. Then the first node has a win chance of 25% versus 50% for the second, so I should prefer the second node.

However, this does not take into account the "likely" moves that my opponent will make. Assume that from C2 there are two possible legal moves (for my opponent), lets call these C21 and C22. I did 50 simulations for both and in C21 my opponent won 50 games out of 50 (100% win chance) and in C22 they won 0 out of 50 (0% win chance). Having done these simulations I can see that it is much more likely that my opponent will take move C21 and not C22. That means that if I take move C2, then my -statistical- win chance is 50%, but my -expected- win chance is close to 0%.

Taking into account this information I would select move C1 and not C2, even though the -statistical- chance of winning is lower. And I could program my algorithm to do exactly this, improving the performance.

This seems like a very obvious improvement for the MCTS algorithm, but I have not seen any reference to it, which makes me suspect that I'm missing something essential.

Can anybody point out the flaw in my reasoning or point me to any articles that deal with this?

1 Answers1

1

Assume that from C2 there are two possible legal moves (for my opponent), lets call these C21 and C22. I did 50 simulations for both and in C21 my opponent won 50 games out of 50 (100% win chance) and in C22 they won 0 out of 50 (0% win chance).

This situation should never be possible, due to the "Selection" step in the MCTS algorithm. The Selection step is the part of the algorithm that traverses the tree using a bandit algorithm (most commonly the UCB1 algorithm). Only afterwards, once you've reached a point in the tree that has not been fully expanded, will you start the (semi)-random Play-Out phase.

In the Selection step (which, in nodes where your opponent is to move, should use statistics from "their perspective") traverses to children based on a balance between exploration and exploitation; it will assign high scores to children with good scores (good from the opponent's perspective if they're to move), but also assign high scores to children that have relatively few visits.

A proper implementation of the Selection step should make the situation you're sketching impossible. The Selection step would not evenly distribute its simulations among C21 and C22. It would rather assign, for example, 95 simulations to C21, and 5 simulations to C22.

It starts out with 1 simulation for each. C21 would win, C22 would lose. Then it would assign for example the next 2 simulations to C21, because that one has a better average score (from the opponent's perspective). Then it might assign the fourth simulation to C22 again for the sake of exploration. Simulations 5, 6, 7, and 8 (for example) would go to C21 again to exploit the good score, maybe simulation 9 to C22 again for exploration, etc.

With an algorithm like UCB1 in the Selection step, it can be proven that in the limit (given an infinite number of simulations), the exploitation would outweigh the exploration so heavily that the average score will converge to the true minimax score.

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55