AlphaZero: which nodes visited during self-play?

Question

Reading this article was very helpful in getting a good understanding of the principles behind AlphaZero. Still, there is something I am not completely sure about.

Below is the author's UCT_search method, as can be consulted in his code on Github: https://github.com/plkmo/AlphaZero_Connect4/tree/master/src
Here, UCTNode.backup() adds the net's value_estimate to all traversed nodes (see also this 'cheat sheet').

def UCT_search(game_state, num_reads,net,temp):
    root = UCTNode(game_state, move=None, parent=DummyNode())
    for i in range(num_reads):
        leaf = root.select_leaf()
        encoded_s = ed.encode_board(leaf.game); encoded_s = encoded_s.transpose(2,0,1)
        encoded_s = torch.from_numpy(encoded_s).float().cuda()
        child_priors, value_estimate = net(encoded_s)
        child_priors = child_priors.detach().cpu().numpy().reshape(-1); value_estimate = value_estimate.item()
        if leaf.game.check_winner() == True or leaf.game.actions() == []: # if somebody won or draw
            leaf.backup(value_estimate); continue
        leaf.expand(child_priors) # need to make sure valid moves
        leaf.backup(value_estimate)
    return root

This method seems to visit only the nodes directly connected to the root node.
Yet, The original DeepMind paper (about AlphaGo Zero) says:

Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a) + U(s, a), where U(s, a) ∝ P(s, a)/(1 + N(s, a)), until a leaf node s' is encountered.

So instead, I would expect something like:

def UCT_search():
    for i in range(num_reads):
        current_node = root
        while current_node.is_expanded:
            …
            current_node = current_node.select_leaf()
        current_node.backup(value_estimate)

^{(UCTNode.is_expanded is False if the node has not been visited yet (or is an end state, i.e. the end of the game)}

Can you please explain why this is the case? Or am I overlooking something?
Thanks in advance

I think as game-state is updated, the root node gets updated automatically. So, the new position would be the root that the function would start exploring from. — instinct71, Jan 13 '20 at 05:51

score 0 · Answer 1 · answered May 27 '20 at 16:44

0

The logic you mention is inside the select_leaf() method, it selects the best leaf and not just the directly connected nodes

answered May 27 '20 at 16:44

Cash Lo

1,052
1
8
20

AlphaZero: which nodes visited during self-play?

1 Answers1