How to implement Deep Q-learning gradient descent

Question

So I'm trying to implement Deep Q-learning algorithm created by Google DeepMind and I think I have got a pretty good hang of it now. Yet there is still one (pretty important) thing I don't really understand and I hope you could help.

Doesn't yj result to a double (Java) and the latter part to a matrix containing Q-values for each action in current state in the following line (4th last line in the algorithm):

So how can I subtract them from each other.

Should I make yj a matrix containing all the data from here except replace the currently selected action with

This doesn't seem like the right answer and I'm a bit lost here as you can see.

As i see it: the Q-part is also 1-dimensional as it's action is fixed to some action a-priori. Look at the pseudocode in your post. ```a_t``` will be selected as the single action, which maximizes the Q-function. Later ```a_t``` will be added to the replay-memory, where it becomes ```a_d``` (still a single fixed action) during sampling in a later step. — sascha, Oct 08 '16 at 13:29
@sascha Yeah I thought of that also, but then I couldn't figure out how I could update weights of my neural network, since shouldn't I calculate the errors for all outputs (actions in this case) in order to update weights? Now if I update with this one error, it updates all weights as if all the outputs had the same error. So should I make an error matrix which contains zeros everywhere else but in that action? Then it would only update the weights affecting that action, right? — Dope, Oct 08 '16 at 13:49
Read up learning-procedures of NNs. That's quite unconnected to the Q-learning framework here! Typically NNs are trained by SGD (with some mini-batch size). If you take a mini-batch size of 1, all the weights are updated too, despite the fact, that you only observe one sample out of millions! That's how it works. The point of Q-learning is, that the internal-state of the Q-function changes and this one-error is shifted to some lower error over time (model-free-learning)! (And regarding your zeroing-approach: No!) Just take this one sample action (from the memory) as one sample of a SGD-step. — sascha, Oct 08 '16 at 13:52
I'm sorry I think I just don't get it. I have already created a working neural network with mini-batch gradient descent in which I first count a matrix containing the errors of output nodes. But now I don't know what kind of matrix I should create for the errors of output nodes with this loss function because the algorithm only calculates the error for one output. I understand supervised neural networks pretty well imo but here I'm having problems with the target value. I guess I need to go over some more lectures. — Dope, Oct 08 '16 at 14:34
If you got this knowledge i don't understand the problem. You got a target ```y_j```, and some input ```x``` (for this one target!). Just calculate the output of this input ```x``` with your NN; the error is then the squared diff between ```y_j``` and the output obtained. This error/loss is then used to reweight the weights within the NN with a backpropagation-step. That's exactly what's happening if you would sample a single x during SGD (while in practice a mini-batch size > 1 is often used; resulting in a matrix of x, with one row per sample and a y-vector with one value per sample). — sascha, Oct 08 '16 at 14:40
I just can't understand how can the error be a double instead of a matrix of all the output errors. I need to go over my maths for back-propagation I think if you say that it should work like that also. — Dope, Oct 08 '16 at 14:56
Because it's only one sample and the NN is outputting a single value. (Of course there are other networks: e.g. multi-class NNs where the output are class-densities; but this NN here mimics a value-function; it just returns **one value** for some input-vector). — sascha, Oct 08 '16 at 15:02
But isn't the NN supposed to output Q-values for all the actions from one input? So that is a matrix of outputs. I give input (state) to NN. NN then forwardpropagates and outputs the approximated Q-values for each action — Dope, Oct 08 '16 at 15:09
No it's not. If that's your understanding you should take some steps back. [wikipedia's Q-learning entry](https://en.wikipedia.org/wiki/Q-learning) explains: ```It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter. A policy is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state``` — sascha, Oct 08 '16 at 15:11
Read some Rl-introductions, especially in regards to the keywords: **value-function** and **policy-function**. — sascha, Oct 08 '16 at 15:14
Yeah I know how it is in Q-learning, but as you can see in line 9 (counting from the very first line) in the algorithm they only give it the current state and then use argmax to pick the highest output from the NN. And I remember them saying somewhere in the papers that the idea was that you only give state and get all the Q-values in return, whereas some algorithms used state-action pairs as input and they were considered too computationally expensive. Edit: You can see it in here page 2 figure 1: http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf — Dope, Oct 08 '16 at 15:15
Also here is something more about this: https://www.nervanasys.com/demystifying-deep-reinforcement-learning/ Ctrl+f and "deep q network". So I guess we were talking about different kind of neural networks which maybe explains why I couldn't get the grasp of your ideas. Thanks for trying to help though and if you have any ideas about this I would appreciate them. Also I hope I'm not completely wrong with my implementation here. Edit: Actually I found the answer (it was the zero one)! — Dope, Oct 08 '16 at 15:48

Dope · Accepted Answer · 2016-10-09T11:59:10.293

7

Actually found it myself. (Got it right from the start :D)

Do a feedforward pass for the current state s to get predicted Q-values for all actions.
Do a feedforward pass for the next state s’ and calculate maximum overall network outputs max a’ Q(s’, a’).
Set Q-value target for action to r + γmax a’ Q(s’, a’) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.
Update the weights using backpropagation.

edited Oct 09 '16 at 11:59

answered Oct 08 '16 at 15:51

Dope

245
1
11

1

Thanks! I was looking for an answer to this question which I asked here https://datascience.stackexchange.com/questions/78070/deep-q-learning-how-to-set-q-value-of-non-selected-actions – Rasoul Jul 21 '20 at 21:39
Comment by user @Flo: In your step 2) and 3) make sure you use "max a’ Q^(s’, a’)" an not "max a’ Q(s’, a’)", so the output from the target network not the main network. (Copied here to let them sleep at night - and delete their non-answer. ;-) ) – Yunnosch Jan 17 '23 at 17:57

How to implement Deep Q-learning gradient descent

1 Answers1