Reading this article was very helpful in getting a good understanding of the principles behind AlphaZero. Still, there is something I am not completely sure about.
Below is the author's UCT_search
method, as can be consulted in his code on Github: https://github.com/plkmo/AlphaZero_Connect4/tree/master/src
Here, UCTNode.backup()
adds the net's value_estimate
to all traversed nodes (see also this 'cheat sheet').
def UCT_search(game_state, num_reads,net,temp):
root = UCTNode(game_state, move=None, parent=DummyNode())
for i in range(num_reads):
leaf = root.select_leaf()
encoded_s = ed.encode_board(leaf.game); encoded_s = encoded_s.transpose(2,0,1)
encoded_s = torch.from_numpy(encoded_s).float().cuda()
child_priors, value_estimate = net(encoded_s)
child_priors = child_priors.detach().cpu().numpy().reshape(-1); value_estimate = value_estimate.item()
if leaf.game.check_winner() == True or leaf.game.actions() == []: # if somebody won or draw
leaf.backup(value_estimate); continue
leaf.expand(child_priors) # need to make sure valid moves
leaf.backup(value_estimate)
return root
This method seems to visit only the nodes directly connected to the root node.
Yet, The original DeepMind paper (about AlphaGo Zero) says:
Each simulation starts from the root state and iteratively selects moves that maximise an upper confidence bound Q(s, a) + U(s, a), where U(s, a) ∝ P(s, a)/(1 + N(s, a)), until a leaf node s' is encountered.
So instead, I would expect something like:
def UCT_search():
for i in range(num_reads):
current_node = root
while current_node.is_expanded:
…
current_node = current_node.select_leaf()
current_node.backup(value_estimate)
(UCTNode.is_expanded
is False
if the node has not been visited yet (or is an end state, i.e. the end of the game)
Can you please explain why this is the case? Or am I overlooking something?
Thanks in advance