In "Stone, Peter, Richard S. Sutton, and Gregory Kuhlmann. "Reinforcement learning for robocup soccer keepaway." Adaptive Behavior 13.3 (2005): 165-188.", the RLstep pseudocode seems quite a bit different from Sarsa(λ), which the authors say RLStep implements.
Here is the RLstep pseudocode and here is the Sarsa(lambda) pseudocode.
The areas of confusion are:
Line 10 in the Sarsa(λ) pseudocode updates the Q value for each state-action pair after adding 1 to the
e(s,a)
. But in the RLstep pseudocode the eligibility trace update (line 19) doesn't happen until after the value update (line 17).Lines 18 and 19 in RLstep seem quite different from the Sarsa(λ) pseudocode.
What are lines 20-25 doing with the eligibility trace?