Implementing SARSA using Gradient Discent

Question

I have successfully implemented a SARSA algorithm (both one-step and using eligibility traces) using table lookup. In essence, I have a q-value matrix where each row corresponds to a state and each column to an action.

Something like:

[Q(s1,a1), Q(s1,a2), Q(s1,a3), Q(s1,a4)]
[Q(s2,a1), (Q(s2,a2), Q(s2a3), Q(s2, a2]
.
.
.
[Q(sn,a1), Q(sn,a2), Q(sn,a3), Q(sn,a4)]

At each time-step, a row from the matrix is picked and, depending on policy, an action is picked and updated according to SARSA rules.

I am now trying to implement it as a neural-network using gradient descent.

My first hypothesis was to create a two-layer network, the input layer having as many input neurons as there are states, and the output layer having as many output neurons as there are actions. Each input would be fully connected to each output. (So, in fact, it would look as the matrix above)

My input vector would be a 1xn row vector, where n is the number of input neurons. All values in the input vector would be 0, except for the index corresponding to the current state which would be 1. Ie:

[0 0 0 1 0 0]

Would be an input vector for an agent in state 4.

So, the process would be something like:

[0 0 0 1 0 0] X [ 4 7 9 3]
                [ 5 3 2 9]
                [ 3 5 6 9]
                [ 9 3 2 6]
                [ 2 5 7 8]
                [ 8 2 3 5]

Where I have created a random, sample weight-matrix.

The result would be:

[9 3 2 6]

Meaning that if a greedy policy was picked action 1 should be picked and the connection between the fourth input neuron and the first output neuron should become stronger by:

dw = dw_old + learning_rate*(reward + discount*network_output - dw_old)

(Equation taken from SARSA algorithm)

HOWEVER - this implementation doesn't convince me. According to what I read, the network weights should be used to calculate the Q-value of a state-action pair, but I'm not sure they should represent such values. (Especially because I've usually seen weight values only being included between 0 and 1.)

Any advice?

score 2 · Accepted Answer · answered May 03 '15 at 18:35

Summary: your current approach is correct, except that you shouldn't restrict your output values to be between 0 and 1.

This page has a great explanation, which I will summarize here. It doesn't specifically discuss SARSA, but I think everything it says should translate.

The values in the results vector should indeed represent your neural network's estimates for the Q-values associated with each state. For this reason, it's typically recommended that you not restrict the range of allowed values to be between zero and one (so just sum the values multiplied by connection weights, rather than using some sort of sigmoid activation function).

As for how to represent the states, one option is to represent them in terms of sensors that the agent has or might theoretically have. In the example below, for instance, the robot has three "feeler" sensors, each of which can be in one of three conditions. Together, they provide the robot with all of the information it's going to get about which state it's in.

enter image description here

However, if you want to give your agent perfect information, you can imagine that it has a sensor that tells it exactly which state it is in, as shown near the end of this page. This would work exactly the way that your network is currently set up, with one input representing each state.

Hi @seaotternerd, thank! :) Just as a quick question, I noted that if I do not normalize weights/resrict the output to some sigmoid function these tend to grow out of proportion. Especially if I train the network over a large number of epochs. In MatLab, large values eventually become NaN. Any advice? — MrD, May 05 '15 at 12:13
I've been looking around for some potential explanations of this, and I'm not finding any. In many ways, setting up effective neural nets (especially for complex learning problems like this) is still largely an art. Are the magnitudes of your rewards very large relative to the largest number that MatLab can represent? — seaotternerd, May 08 '15 at 20:55
Yes, especially because for the type of problem I have rewards can change depending on various conditions. I have managed to solve by using a sigmoid function to model the output of the network. Not ideal by seems to be doing the job. Thanks for your help! :) — MrD, May 09 '15 at 12:47

Implementing SARSA using Gradient Discent

1 Answers1