I have successfully implemented a SARSA algorithm (both one-step and using eligibility traces) using table lookup. In essence, I have a q-value matrix where each row corresponds to a state and each column to an action.
Something like:
[Q(s1,a1), Q(s1,a2), Q(s1,a3), Q(s1,a4)]
[Q(s2,a1), (Q(s2,a2), Q(s2a3), Q(s2, a2]
.
.
.
[Q(sn,a1), Q(sn,a2), Q(sn,a3), Q(sn,a4)]
At each time-step, a row from the matrix is picked and, depending on policy, an action is picked and updated according to SARSA rules.
I am now trying to implement it as a neural-network using gradient descent.
My first hypothesis was to create a two-layer network, the input layer having as many input neurons as there are states, and the output layer having as many output neurons as there are actions. Each input would be fully connected to each output. (So, in fact, it would look as the matrix above)
My input vector would be a 1xn row vector, where n is the number of input neurons. All values in the input vector would be 0, except for the index corresponding to the current state which would be 1. Ie:
[0 0 0 1 0 0]
Would be an input vector for an agent in state 4.
So, the process would be something like:
[0 0 0 1 0 0] X [ 4 7 9 3]
[ 5 3 2 9]
[ 3 5 6 9]
[ 9 3 2 6]
[ 2 5 7 8]
[ 8 2 3 5]
Where I have created a random, sample weight-matrix.
The result would be:
[9 3 2 6]
Meaning that if a greedy policy was picked action 1 should be picked and the connection between the fourth input neuron and the first output neuron should become stronger by:
dw = dw_old + learning_rate*(reward + discount*network_output - dw_old)
(Equation taken from SARSA algorithm)
HOWEVER - this implementation doesn't convince me. According to what I read, the network weights should be used to calculate the Q-value of a state-action pair, but I'm not sure they should represent such values. (Especially because I've usually seen weight values only being included between 0 and 1.)
Any advice?