4

I'm trying to implement linear gradient-descent Sarsa based on Sutton & Barto's Book, see the algorithm in the picture below.

However, I struggle to understand something in the algorithm:

  • Is the dimension of w and z independent of how many different actions can be taken? It seems in the book they have dimension equal to the number of features, which I would say is independent on how many actions.
  • Is there a w and a z for each action? Also, I cannot see in the book that this should be the case.
  • If I am right in the two bullets above, then I cannot see how the index list F_a will depend on the actions, and therefore I cannot see how the action-value function q_a can depend on the actions (see the lines marked with yellow below in the algorithm) But the action-value must depend on the actions. So there is something I am not getting...

I hope anyone can help clarifying this for me :)

Sarsa algo

bbiegel
  • 207
  • 2
  • 8

1 Answers1

4

w is the weight vector for the function approximator. The function you are approximating is Q(s,a), the action-value function, which tells you the value of taking an action in a state. Its up to you to define the weights, but yes, you're right, you need to think about how you want to represent the actions in the weights. One way might be to define a set of state features, then instantiate them once per action (multiple separate w vectors). For convenience's sake, you could then concatenate these vectors into one big w, because you know that only chunks of weight vector that were activated by a state-action pair's features would be updated. Having multiple disjoint sets of state features per action is a lot of weights if the action space is large though, so you might compress multiple actions into different scalar values of a single weight. If the true Q values are close between actions, you'll be able to perform just as well, and you'll actually learn faster because there are fewer weights that need to be optimized. The representation is flexible. It's up to you!

I encourage you to look at the algorithm as written in the second edition of the book (drafts are available from the authors' sites). The notation is clearer. The algorithm you posted is actually a lambda return method, which you can read about in chapter 12 (z is an eligibility trace, it has the same dimension as w and isn't critical to the question you are asking). Episodic semi-gradient Sarsa, the same algorithm minus some bells and whistles, appears in section 10.1.

Nick Walker
  • 790
  • 6
  • 19
  • Hi Nick and thanks a lot for the answer! Your answer makes sense. I implemented the algorithm above with one w for each action, and it worked. Can you clarify your point regarding compressing multiple actions into the same scalar? Do you need prior model knowledge in order to know which actions in what states you can combine? – bbiegel Dec 05 '16 at 07:44
  • Yes. Any parameterization beyond a tabular one (which has one binary feature/weight per each S-A pair) requires you to make assumptions about the form of the value function. If you know that states in a local area have about the same value, you might collapse them into a single binary feature. If two actions are similar, you might collapse them into a single feature with two scalar activations. This limits your value function's ability to express the difference between them, which may be good or bad depending on how well it reflects the underlying value function you are trying to learn. – Nick Walker Dec 17 '16 at 17:43