6

I've been reading through the Monte Carlo Tree Search survey paper by Browne et. al:

http://ccg.doc.gold.ac.uk/papers/browne_tciaig12_1.pdf

"A Survey of Monte Carlo Tree Search Methods"

I'm wrestling with just one piece of the pseudocode on p. 9. My question occurs in a similar form in both the Backup and BackupNegamax functions.

Suppose that I'm player 1 in a 2-player zero-sum game. (So, using the BackupNegamax function.) It's my turn to move, and I'm using MCTS to choose my move. In BackupNegamax, why is the delta value negated as you back up the tree? I understand that in a two-player zero-sum game, if the reward is delta for player 1 (me), then it's -delta for player 2. But shouldn't the entire tree be from player 1's perspective? (This would then be similar to how nodes are rated in a minimax tree, if I'm not mistaken.)

If the perspective of the Q value switches back and forth depending on what level of the tree you're on, wouldn't that mess up the calculation shown in the BestChild function? Specifically, suppose some node v has a very high Q value, because it has often led to high rewards for player 1. The given pseudocode seems to suggest that v's parent, which I'll call u, would likely have a very low (very negative) Q value (of course u's Q value would also account for its other children's Q values.)

So it doesn't make sense to me that u (the parent) would have a very low Q value while v (the child) has a very high one. I know v is from player 1's perspective in the pseudocode, and u is from player 2's perspective, but my question is why. Why aren't both node's Q values stored from player 1's perspective? That way both u and v would have high Q values, and thus high exploitation ratings, and they'd both be considered valuable for further exploitation according to the BestChild function.

(I'm coming at MCTS from experience with minimax, and in minimax the entire tree is from Max's perspective, so that's why I'm struggling with the different idea here.)

My question also applies to Backup - why is each Q value updated according to the perspective of the player at that level of the tree, instead of everything being updated from "my" perspective?

I hope I've been clear in my question. Thank you very much for your help!

Bob Smith
  • 143
  • 2
  • 7

4 Answers4

6

There are two ways to describe this mechanism:

  1. Globally: From the perspective of the root player, in which case the playout values at every second ply are negated, as the opponent is acting against the root player.

  2. Locally: From the perspective of the player who has just moved at each ply, in which case the playout value is not negated, as each player tries to maximise their own reward.

The standard formulation uses option 1, as it's easier to describe, and has its basis in two-player combinatorial games. However, I tend to use the second formulation in my actual implementations, as it is more flexible; it handles games with more than two players, less than two players, variable move order, multi-part moves, cooperative goals, etc.

This just confirms what is said in the other answers.

Adam Stelmaszczyk
  • 19,665
  • 4
  • 70
  • 110
2

There are two ways to look at MCTS algorithm:

  1. From the perspective of the root player.
  2. From the perspective of the player that has just moved.

I found way 1 to be more popular. For example Wikipedia explanation uses it.

Reference MCTS implementations using way 1: C++, Java.

Adam Stelmaszczyk
  • 19,665
  • 4
  • 70
  • 110
  • That makes sense and is how I understood things to work. My question, then, is how to make sense of the BackupNegamax pseudocode function as indicated in the paper by Browne et. al. This is a classic paper, so I don't imagine it's wrong - just a different formulation, perhaps? Browne's class notes at http://ccg.doc.gold.ac.uk/teaching/ludic_computing/ludic16.pdf , p. 12 about Backpropagation, also suggest negating values at each ply. – Bob Smith May 28 '15 at 15:09
  • @BobSmith Exactly, it's not wrong, it's just a different formulation. – Adam Stelmaszczyk May 28 '15 at 16:03
  • the java example link is gone – alexzzp Oct 16 '16 at 08:45
0

I have been confused with the MCTS for a while, especifially for backpropagation part. If each node's win value (called Q) is used to indicate current node's player's winner times. In each non-expandable node we choose the largest UCT node. how can it be a good choice? Consider following two player game, the full tree is like this:

A / | \ B1 B2 B3 | A1

In the tree B1,B3 is a B win terminal node, while B2 has only one choice that leads to a A win terminal node A1.

If we caculate the game in MCTS methods, the result will be like following graph:

enter image description here

So the best choice will be B1 or B3 for A, which is ridiculous, how to explain it?

ref: MCTS caculation process reference

alexzzp
  • 419
  • 2
  • 6
  • 15
0

for a loss or a win terminal situation, you should use int.max score or int.lowest score so when you backpropogate a loss will have the lowest possible score no matter how low in the tree you are, and a win will be the highest score

dhkts1
  • 1