1

Reading this article was very helpful in getting a good understanding of the principles behind AlphaZero. Still, there is something I am not completely sure about.

Below is the author's UCT_search method, as can be consulted in his code on Github: https://github.com/plkmo/AlphaZero_Connect4/tree/master/src
Here, UCTNode.backup() adds the net's value_estimate to all traversed nodes (see also this 'cheat sheet').

def UCT_search(game_state, num_reads,net,temp):
    root = UCTNode(game_state, move=None, parent=DummyNode())
    for i in range(num_reads):
        leaf = root.select_leaf()
        encoded_s = ed.encode_board(leaf.game); encoded_s = encoded_s.transpose(2,0,1)
        encoded_s = torch.from_numpy(encoded_s).float().cuda()
        child_priors, value_estimate = net(encoded_s)
        child_priors = child_priors.detach().cpu().numpy().reshape(-1); value_estimate = value_estimate.item()
        if leaf.game.check_winner() == True or leaf.game.actions() == []: # if somebody won or draw
            leaf.backup(value_estimate); continue
        leaf.expand(child_priors) # need to make sure valid moves
        leaf.backup(value_estimate)
    return root


This method seems to visit only the nodes directly connected to the root node.
Yet, The original DeepMind paper (about AlphaGo Zero) says:

Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a) + U(s, a), where U(s, a) ∝ P(s, a)/(1 + N(s, a)), until a leaf node s' is encountered.

So instead, I would expect something like:

def UCT_search():
    for i in range(num_reads):
        current_node = root
        while current_node.is_expanded:
            …
            current_node = current_node.select_leaf()
        current_node.backup(value_estimate)

(UCTNode.is_expanded is False if the node has not been visited yet (or is an end state, i.e. the end of the game)


Can you please explain why this is the case? Or am I overlooking something?
Thanks in advance

Jonas De Schouwer
  • 755
  • 1
  • 9
  • 15
  • I think as game-state is updated, the root node gets updated automatically. So, the new position would be the root that the function would start exploring from. – instinct71 Jan 13 '20 at 05:51

1 Answers1

0

The logic you mention is inside the select_leaf() method, it selects the best leaf and not just the directly connected nodes

Cash Lo
  • 1,052
  • 1
  • 8
  • 20