4

I am currently reading Sutton's introduction about reinforcement learning. After arriving in chapter 10 (On-Policy prediction with approximation), I am now wondering how to choose the form of the function q for which the optimal weights w shall be approximated.

I am referring to the first line of the pseudo code below from Sutton: How do I choose a good differentiable function enter image description here? Are there any standard strategies to choose it?

enter image description here

zimmerrol
  • 4,872
  • 3
  • 22
  • 41

1 Answers1

4

You can choose any function approximator that is differentiable. Two commonly used classes of value function approximators are:

  1. Linear function approximators: Linear combinations of features

     For approximating Q (the action-value)
     1. Find features that are functions of states and actions.
     2. Represent q as a weighted combination of these features.
    

    enter image description here

    where phi_sa is a vector in Rd with ith component given by enter image description here and w is the weight vector enter image description here whose ith componenet is given by enter image description here.

  2. Neural Network

    Represent qSAW using a neural network. You can either approximate using action-in (left of figure below) type or action-out (right of figure below) type. The difference being that the neural network can either take as input representations of both the state and the action and produce a single value (Q-value) as the output or take as input only the representation of state s and provide as output one value for each action, a in the action space (This type is easier to realize if the action space is discrete and finite).

    enter image description here

    Using the first type (action-in) for the example as it is close to the example in the linear case, you could create a Q-value approximator using a neural network with the following approach:

      Represent the state-action value as a normalized vector
      (or as a one-hot vector representing the state and action)
      1. Input layer : Size= number of inputs
      2. `n` hidden layers with `m` neurons
      3. Output layer: single output neuron
      Sigmoid activation function.
      Update weights using gradient descent as per the * semi-gradient Sarsa algorithm*.
    

    You could also directly use the visuals (if available) as the input and use convolutional layers like in the DQN paper. But read the note below regarding the convergence and additional tricks to stabilize such non-linear approximator based method.


Graphically the function approximator looks like this:

linearFA

Note that varphi_eqphi is an elementary function and xi is used to represent elements of the state-action vector. You can use any elementary function in place of enter image description here. Some common ones are linear regressors, Radial Basis Functions etc.

A good differentiable function depends on the context. But in reinforcement learning settings, convergence properties and the error bounds are important. The Episodic semi-gradient Sarsa algorithm discussed in the book has similar convergence properties as of TD(0) for a constant policy.

Since you specifically asked for on-policy prediction, using a linear function approximator is advisable to use because it is guaranteed to converge. The following are some of the other properties that make the Linear function approximators suitable:

  • The error surface becomes a quadratic surface with a single minimum with mean square error function. This makes it a sure-shot solution as gradient descent is guaranteed to find the minima which is the global optimum.
  • The error bound (as proved by Tsitsiklis & Roy,1997 for the general case of TD(lambda) ) is:

    enter image description here

    Which means that the asymptotic error will be no more than enter image description here times the smallest possible error. Where gamma is the discount factor. The gradient is simple to calculate!

Using a non-linear approximator (like a (deep) neural network) however does not inherently guarantee convergence. Gradient TD method uses the true gradient of the projected bellman error for the updates instead of the semi-gradient used in the Episodic semi-gradient Sarsa algorithm which is known to provide convergence even with non-linear function approximators (even for off-policy prediction) if certain conditions are met.

  • Thanks for the detailed answer. When I use a `ANN` and want to update the weights, one usally uses a loss function which will be minimized during a backpropagation process - what loss function do I choose in this case? – zimmerrol Jul 26 '17 at 09:13
  • In that case, the loss for the Q-value function estimate (for the Episodic semi-gradient Sarsa algorithm as per your question) will be: `R + Gamma * q_hat(S',A',W) - q_hat(S,A,W)` for each step. Where `S` is the current state, `A` is the action that the agent takes in the current state for which the agent receives a reward `R` and ends up in the next state `S'`, after which the agent takes an action `A'`. This loss term is listed in the pseudo code on the line where the weight is updated. Note that if `S' is a terminal state, q_hat(S',A',W) is 0. – PraveenPalanisamy Jul 28 '17 at 00:29
  • This term: `R + Gamma * q_hat(S',A',W)` is called the *target* or *label* which is a common term used in supervised learning (for example using ANNs). Which makes the loss term easy to comprehend as : `Loss = target - predicted. – PraveenPalanisamy Jul 28 '17 at 00:35
  • So I can train my network just with one by one sample as I need to use the current prediction of the old state to perform the current update? This means, that I cannot use batches - is this right? – zimmerrol Jul 28 '17 at 10:23
  • Yes. Your understanding is correct. As per the *episodic semi-gradient Sarsa* algorithm, you would update the weights of the function approximator after every step of each of the episodes. In other variations for example using *experience replay*, typically some or all of the previously observed experiences (S,A,R,S',A') are collected/stored and used to update the weights in a batch rather than single example updates. – PraveenPalanisamy Aug 05 '17 at 21:09