8

I have implemented Q-Learning as described in,

http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf

In order to approx. Q(S,A) I use a neural network structure like the following,

  • Activation sigmoid
  • Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
  • Outputs, single output. Q-Value
  • N number of M Hidden Layers.
  • Exploration method random 0 < rand() < propExplore

At each learning iteration using the following formula,

enter image description here

I calculate a Q-Target value then calculate an error using,

error = QTarget - LastQValueReturnedFromNN

and back propagate the error through the neural network.

Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.

Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)

Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.

Hamza Yerlikaya
  • 49,047
  • 44
  • 147
  • 241
  • You may consider also asking this at http://datascience.stackexchange.com/ – runDOSrun Dec 07 '14 at 13:07
  • My guess is the answers are yes, yes, yes, however neural networks are complex beasts. It's easy to give wrong parameters, learning rate, number of hidden units. In contrast, tabular Q learning is trivial to implement and debug. – maxy Dec 07 '14 at 13:17
  • @maxy Tabular Q-learning is trivial as long as state-space is rather small. – Luke Jan 05 '16 at 12:45
  • If you found the answer useful, would you mind accepting it? – Juan Leni Feb 28 '16 at 11:40

1 Answers1

6

Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way.

Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one.

Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.

Juan Leni
  • 6,982
  • 5
  • 55
  • 87