1

This question is an attempt to reframe this question to make it clearer.

This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions.

These discussions (The Basic Update Rule and Linear Value Function Approximation) show a set of weights for each action.

The reason they are different is that the first slide assumes you can anticipate the result of performing an action and then find features for the resulting states. (Note that the feature functions are functions of both the current state and the anticipated action.) In that case, the same set of weights can be applied to all the resulting features.

But in some cases, one can't anticipate the effect of an action. Then what does one do? Even if one has perfect weights, one can't apply them to the results of applying the actions if one can't anticipate those results.

My guess is that the second pair of slides deals with that problem. Instead of performing an action and then applying weights to the features of the resulting states, compute features of the current state and apply possibly different weights for each action.

Those are two very different ways of doing feature-based approximation. Are they both valid? The first one makes sense in situations, e.g., like Taxi, in which one can effectively simulate what the environment will do at each action. But in some cases, e.g., cart-pole, that's not possible/feasible. Then it would seem you need a separate set of weights for each action.

Is this the right way to think about it, or am I missing something?

Thanks.

RussAbbott
  • 2,660
  • 4
  • 24
  • 37
  • Why do you think it is necessary to anticipate the effect of an action? If I've understood correctly, the feature functions are applied to the current state and current action, you don't need to anticipate the next state. Right? – Pablo EM Dec 08 '18 at 13:18
  • There is some code from UC Berkeley for a class that teaches reinforcement learning. It tries each action and by anticipating what each does, determines the feature values for each of those states and picks an action based which produces the best result. – RussAbbott Dec 09 '18 at 04:40
  • Do you mean the class from the video you included in the question? From the slide you mention doesn't seem they try all the actions, but honestly I haven't watch the complete video. Anyway, could you please point me to the code you are talking about? I'm curious about it :D – Pablo EM Dec 09 '18 at 12:33
  • Here's a version translated from Python 2 t Python 3. (https://drive.google.com/file/d/1LJxpJNAu2K_FCq9KLAMhgrrCxcjtg18N/view?usp=sharing) – RussAbbott Dec 09 '18 at 21:28
  • You might be interested in the RL portion of this course: https://inst.eecs.berkeley.edu/~cs188/fa18/project3.html#Q10. (The full course: https://inst.eecs.berkeley.edu/~cs188/fa18/. RL Is Week 5. The relevant project is P3 listed on the Week 6 line.) – RussAbbott Dec 09 '18 at 21:56

0 Answers0