1

I am reading Silver et al (2012) "Temporal-Difference Search in Computer Go", and trying to understand the update order for the eligibility trace algorithm. In the Algorithm 1 and 2 of the paper, weights are updated before updating the eligibility trace. I wonder if this order is correct (Line 11 and 12 in the Algorithm 1, and Line 12 and 13 of the Algorithm 2). Thinking about an extreme case with lambda=0, the parameter is not updated with the initial state-action pair (since e is still 0). So I doubt the order possibly should be the opposite.

Can someone clarify the point?

I find the paper very instructive for learning the reinforcement learning area, so would like to understand the paper in detail.

If there is a more suitable platform to ask this question, please kindly let me know as well.

enter image description here enter image description here

Kota Mori
  • 6,510
  • 1
  • 21
  • 25
  • 1
    For future reference; questions like this are probably a better fit over on https://ai.stackexchange.com/ rather than StackOverflow. We also have support for proper math in questions/answers there! – Dennis Soemers Oct 18 '18 at 17:28

1 Answers1

3

It looks to me like you're correct, e should be updated before theta. That's also what should happen according to the math in the paper. See, for example, Equations (7) and (8), where e_t is first computed using phi(s_t), and only THEN is theta updated using delta V_t (which would be delta Q in the control case).

Note that what you wrote about the extreme case with lambda=0 is not entirely correct. The initial state-action pair will still be involved in an update (not in the first iteration, but they will be incorporated in e during the second iteration). However, it looks to me like the very first reward r will never be used in any updates (because it only appears in the very first iteration, where e is still 0). Since this paper is about Go, I suspect it will not matter though; unless they're doing something unconventional, they probably only use non-zero rewards for the terminal game state.

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
  • Thank you for the answer, as well as the suggestion of the https://ai.stackexchange.com/. Your point about my case also makes sense. – Kota Mori Oct 19 '18 at 05:21