When using functional approximation in reinforcement learning how does one select actions?

Question

This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions. I'm confused about how to write the feature functions.

Given an observation, I can understand how to extract features from the observation. But given an observation, one doesn't know what the result of taking an action will be on the features. So how does one write a function that maps an observation and an action to a numerical value?

In the Pacman example shown a few slides later, one knows, given a state, what the effect of an action will be. But that's not always the case. For example, consider the cart-pole problem (in OpenAI gym). The features (which are, in fact, what the observation consists of) are four values: cart position, cart velocity, pole angle, and pole rotational velocity. There are two actions: push left, and push right. But one doesn't know in advance how those actions will change the four feature values. So how does one compute Q(s, a)? That is, how does one write the feature functions f_i(state, action)?

Thanks.

I believe that what I didn't understand was that each action has its own set of weights. See "The Basic Update Rule" (https://www.youtube.com/watch?v=vVDKzIxzkzQ&feature=youtu.be) and "Linear Value Function Approximation" (https://www.youtube.com/watch?v=h5eEzgQI_SE&feature=youtu.be). If I'm wrong about this, I would appreciate someone correcting me. Thanks. — RussAbbott, Oct 31 '18 at 18:59
If one needs a set of weights for each action, why does the original slide not show that? — RussAbbott, Nov 20 '18 at 17:14
I have rephrased this question and am trying again here: https://stackoverflow.com/questions/53398440/in-reinforcement-learning-using-feature-approximation-does-one-have-a-single-se. — RussAbbott, Nov 20 '18 at 17:31

score 0 · Answer 1 · answered Nov 06 '18 at 04:51

How you select actions depends on your algorithm and your exploration strategy. For example, in Q learning you can do something called epsilon greedy exploration. Espilon % of the time you select an action at random and the other % of the time you take the action with the highest expected value (the greedy action).

So how does one write a function that maps an observation and an action to a numerical value?

By using rewards you can approximate state, action values. Then use the rewards and (depending on the algorithm) the value of the next state. For example a Q learning update formula:

You update the old Q(s,a) value with the reward and your estimate of the optimal future value from the next state.

In tabular Q learning you can estimate each Q(s,a) value individually and update the value everytime you visit a state and take an action. In function approximation Q learning you use something like a neural net to approximate the values of Q(s,a). When choosing what action to select you enter the state and action into the neural net and get back the neural net's approximate values of each action. Then pick the action based on your algorithm (like the epsilon greedy method). As your agent interacts with the environment, you train and update the neural net with the new data to improve the function approximation.

As far as I can tell, this doesn't answer my question. If there is one set of weights, how does one decide what action to take for a given state/feature situation? If there is a set of weights for each possible action, that seems more reasonable that it can be done. — RussAbbott, Nov 20 '18 at 17:10
An example of one set of weights: Deep Q Network. It's a neural net which has number of outputs of the neural net equal to number of available actions. You enter the state into the neural net (typically image data) and the neural net outputs a value for each action. Then you select the action with the highest value. — Aurelian Tactics, Nov 22 '18 at 00:03

When using functional approximation in reinforcement learning how does one select actions?

1 Answers1

Linked