I tried to implement tic-tac-toe hello-world MCTS game player but I encountered a problem.
While simulating the game and choosing "the most promising" (exploit/explore) node I only take total wins number into account ("exploit" part) - this causes certain problem, the resulting algorithm is not defensive at all. As a result when choosing between
- move that results in (100 draws; 10 loses)
- move that results in (1 wins; 109 loses)
the worse one is chosen (1; 109) because my uct function greedily counts avg wins instead of "value".
Am I identyfing this problem correctly? Should I switch from "avg wins" to some other value metric that takes all results types into account ?
Any advice is welcome, thanks