I am implementing a Go playing program roughly according to the architecture of earlier versions of AlphaGo(AlphaGo Fan or AlphaGo Lee), e.g. using policy network, value network, and Monte Carlo tree search(MCTS). Currently I have trained a decent policy network and an insensitive value network, and I don't have a fast roll-out policy. By "insensitive" I mean, the value network is not able to judge complicated situations, only outputing a win rate around 50% unless the situation is concise. The value network can judge concise board(no big fight going on) correctly.
Using this policy network and value network, I also implemented MCTS algorithm(The evaluation of a tree node is done only by value network). Since the value network is not accurate, I am afraid MCTS is prone to be trapped in bad moves before the time of MCTS is up. In order to better fine-tune the hyper parameters of MCTS to remedy the bad influence brought by inaccurate value network, I have two questions to ask:
- Node selection is done by
arg max (p_value + lambda * p_policy/visit_cnt)
. Does fine-tune the parameterlambda
help? - Intuitively I want MCTS to explore as further as possible. In node expansion stage, does setting the expansion condition as
expand a leaf once it is visited a very small number of times, like 3
help? What expansion method should I use?
EDIT: The second question is about the 'expand' stage of typical 'selection, expand, evaluation, backup' MCTS algorithm. I reckon by expand as quickly as possible, the MCTS can explore deeper, and give more accurate value approximations. I set a parameter n
as how many times a leaf node is visited before it is expanded
. I want to know intuitively, what a large n
and a small n
would influence the performance of MCTS.