Im wondering how TD can be included in MCTS to enhance its learning? Most TD applications use the Reward obtained in the next state S', however, in MCTS rewards are obtained after a whole rollout, so, how can TD be implemented?
Would it be something like:
Q(s) = Q(s) + a*(Reward - Q(s))
for every node in the backpropagation stage? Currently I update the average reward obtained for every node but i think a TD implementation would be better.
Thanks in advance