1

I'm implementing an AI that plays 2048 using monte carlo tree search. According to wikipedia https://en.wikipedia.org/wiki/Monte_Carlo_tree_search and all other sources that I have checked in the expansion step you should use the UCB formula in order to determine which node to visit wi/ni + c*sqrt(ln(N)/ni). This formula works well when the score at the end is either 0 or 1 (win or lose), however, this formula doesn't work in 2048 because the score is a value between 0 and n that we want to maximize.

Does anyone know which is the optimal formula used for UCB in MCTS when the score is a value between 0 and n so I could use it in the 2048 game?

Thank you.

manlio
  • 18,345
  • 14
  • 76
  • 126
joan capell
  • 140
  • 2
  • 8
  • This question is probably off-topic here. https://ai.stackexchange.com/ would be a better fit, since it's not so much about a specific programming issue as it is about the concepts behind the algorithm. – Dennis Soemers Aug 16 '19 at 19:51

1 Answers1

1

The highest possible score for 2048 seems to be somewhere near 4000000 points.

So you just have to scale the maximum possible score to 1:

game_score / 3932156

Squashing to the [0, 1] range is quite common.

A possible issue is the difference between the maximum possible score and the most likely scores. In 2048 scores may be far lower than the maximum and simple scaling would produce most scores in a tight range (leaving the rest of the range up to 1 rarely used).

This may have unintended consequences in the UCT calculation, as nodes would look more similar than they should due to this squashing (under an unrealistically high maximum possible score).

You have to try: it also happens than squashing accuracy has minimal impact (take a look at Using Domain knowledge to Improve Monte-Carlo Tree Search Performance in Parameterized Poker Squares - Robert Arrington, Clay Langley and Steven Bogaerts for further details).

manlio
  • 18,345
  • 14
  • 76
  • 126
  • 1
    Using a log scale could avoid the issue where most normalised scores are close to 0. – myrtlecat Sep 04 '19 at 16:39
  • Using that "full" normalisation of the maximum theoretical score can be bad in practice (if in practice the rewards realistically tend to be on a much smaller scale). It may be better to dynamically normalise based on what you observe, and this can also be done locally inside the search tree (normalising based on different bounds in different subtrees). See, for example, the top of page 21 in [this paper](https://www.jair.org/index.php/jair/article/view/11099/26289) (and also more discussions about normalization later throughout the paper) – Dennis Soemers Sep 04 '19 at 16:55