I am learning about SARSA algorithm implementation and had a question. I understand that the general "learning" step takes the form of:
Robot (r) is in state s. There are four actions available:
North (n), East (e), West (w) and South (s)
such that the list of Actions,
a = {n,w,e,s}
The robot randomly picks an action, and updates as follows:
Q(a,s) = Q(a,s) + L[r + DQ(a',s1) - Q(a,s)]
Where L
is the learning rate, r
is the reward associated to (a,s)
, Q(s',a')
is the expected reward from an action a'
in the new state s'
and D
is the discount factor.
Firstly, I don't undersand the role of the term - Q(a,s)
, why are we re-subtracting the current Q-value?
Secondly, when picking actions a
and a'
why do these have to be random? I know in some implementations or SARSA all possible Q(s', a')
are taken into account and the highest value is picked. (I believe this is Epsilon-Greedy?) Why not to this also to pick which Q(a,s)
value to update? Or why not update all Q(a,s)
for the current s
?
Finally, why is SARSA limited to one-step lookahead? Why, say, not also look into an hypothetical Q(s'',a'')
?
I guess overall my questions boil down to what makes SARSA better than another breath-first or depth-first search algorithm?