Q-Learning equation in Deep Q Network

Question

I'm new to reinforcement learning at all, so I may be wrong.

My questions are:

Is the Q-Learning equation ( Q(s, a) = r + y * max(Q(s', a')) ) used in DQN only for computing a loss function?
Is the equation recurrent? Assume I use DQN for, say, playing Atari Breakout, the number of possible states is very large (assuming the state is single game's frame), so it's not efficient to create a matrix of all the Q-Values. The equation should update the Q-Value of given [state, action] pair, so what will it do in case of DQN? Will it call itself recursively? If it will, the quation can't be calculated, because the recurrention won't ever stop.

I've already tried to find what I want and I've seen many tutorials, but almost everyone doesn't show the background, just implements it using Python library like Keras.

Thanks in advance and I apologise if something sounds dumb, I just don't get that.

Dennis Soemers · Accepted Answer · 2018-05-29T12:39:43.307

Is the Q-Learning equation ( Q(s, a) = r + y * max(Q(s', a')) ) used in DQN only for computing a loss function?

Yes, generally that equation is only used to define our losses. More specifically, it is rearranged a bit; that equation is what we expect to hold, but it generally does not yet precisely hold during training. We subtract the right-hand side from the left-hand side to compute a (temporal-difference) error, and that error is used in the loss function.

Is the equation recurrent? Assume I use DQN for, say, playing Atari Breakout, the number of possible states is very large (assuming the state is single game's frame), so it's not efficient to create a matrix of all the Q-Values. The equation should update the Q-Value of given [state, action] pair, so what will it do in case of DQN? Will it call itself recursively? If it will, the quation can't be calculated, because the recurrention won't ever stop.

Indeed the space of state-action pairs is much too large to enumerate them all in a matrix/table. In other words, we can't use Tabular RL. This is precisely why we use a Neural Network in DQN though. You can view Q(s, a) as a function. In the tabular case, Q(s, a) is simply a function that uses s and a to index into a table/matrix of values.

In the case of DQN and other Deep RL approaches, we use a Neural Network to approximate such a "function". We use s (and potentially a, though not really in the case of DQN) to create features based on that state (and action). In the case of DQN and Atari games, we simply take a stack of raw images/pixels as features. These are then used as inputs for the Neural Network. At the other end of the NN, DQN provides Q-values as outputs. In the case of DQN, multiple outputs are provided; one for every action a. So, in conclusion, when you read Q(s, a) you should think "the output corresponding to a when we plug the features/images/pixels of s as inputs into our network".

Further question from comments:

I think I still don't get the idea... Let's say we did one iteration through the network with state S and we got following output [A = 0.8, B = 0.1, C = 0.1] (where A, B and C are possible actions). We also got a reward R = 1 and set the y (a.k.a. gamma) to 0.95 . Now, how can we put these variables into the loss function formula https://i.stack.imgur.com/Bu3S8.jpg? I don't understand what's the prediction if the DQN outputs which action to take? Also, what's the target Q? Could you post the formula with placed variables, please?

First a small correction: DQN does not output which action to take. Given inputs (a state s), it provides one output value per action a, which can be interpreted as an estimate of the Q(s, a) value for the input state s and the action a corresponding to that particular output. These values are typically used afterwards to determine which action to take (for example by selecting the action corresponding to the maximum Q value), so in some sense the action can be derived from the outputs of DQN, but DQN does not directly provide actions to take as outputs.

Anyway, let's consider the example situation. The loss function from the image is:

loss = (r + gamma max_a' Q-hat(s', a') - Q(s, a))^2

Note that there's a small mistake in the image, it has the old state s in the Q-hat instead of the new state s'. s' in there is correct.

In this formula:

r is the observed reward
gamma is (typically) a constant value
Q(s, a) is one of the output values from our Neural Network that we get when we provide it with s as input. Specifically, it is the output value corresponding to the action a that we have executed. So, in your example, if we chose to execute action A in state s, we have Q(s, A) = 0.8.
s' is the state we happen to end up in after having executed action a in state s.
Q-hat(s', a') (which we compute once for every possible subsequent action a') is, again, one of the output values from our Neural Network. This time, it's a value we get when we provide s' as input (instead of s), and again it will be the output value corresponding to action a'.

The Q-hat instead of Q there is because, in DQN, we typically actually use two different Neural Networks. Q-values are computed using the same Neural Network that we also modify by training. Q-hat-values are computed using a different "Target Network". This Target Network is typically a "slower-moving" version of the first network. It is constructed by occasionally (e.g. once every 10K steps) copying the other Network, and leaving its weights frozen in between those copy operations.

Thank you very much for your reply! I still can't understand one thing, though: if the function looks like `Q(x) = Q(x')` (the equation "in short"), then doesn't it mean `Q(x)` will be called to get itself? — anx199, May 29 '18 at 10:32
@anx199 The `Q(s, a)` **function** does not look like that. The **function** looks like a (non-recursive) Neural Network. However, we know that in theory that recursive equation **should** hold. In practice, that equation will not yet hold during training. In DQN, we do try to train our Network in such a way that we get as close as possible to making that equation true whenever we separately compute `Q(s, a)` and `Q(s', a')` (each in a non-recursive manner, simply by using our NNs) — Dennis Soemers, May 29 '18 at 10:40
I think I still don't get the idea... Let's say we did one iteration through the network with state S and we got following output `[A = 0.8, B = 0.1, C = 0.1]` (where `A`, `B` and `C` are possible actions). We also got a reward `R = 1` and set the `y` (a.k.a. `gamma`) to `0.95 `. Now, how can we put these variables into the loss function formula https://imgur.com/a/2wTj7Yn? I don't understand what's the `prediction` if the DQN outputs which action to take? Also, what's the `target Q`? Could you post the formula with placed variables, please? Thanks in advance and I apologise for still asking.. — anx199, May 29 '18 at 12:16
Thank you very much! Finally, I understand that. I really appreciate your help and patience :) — anx199, May 29 '18 at 12:57

score 0 · Answer 2 · answered May 29 '18 at 09:59

Firstly, the Q function is used both in the loss function and for the policy. Actual output of your Q function and the 'ideal' one is used to calculate a loss. Taking the highest value of the output of the Q function for all possible actions in a state is your policy.

Secondly, no, it's not recurrent. The equation is actually slightly different to what you have posted (perhaps a mathematician can correct me on this). It is actually Q(s, a) := r + y * max(Q(s', a')). Note the colon before the equals sign. This is called the assignment operator and means that we update the left side of the equation so that it is equal to the right side once (not recurrently). You can think of it as being the same as the assignment operator in most programming languages (x = x + 1 doesn't cause any problems).

The Q values will propagate through the network as you keep performing updates anyway, but it can take a while.

Q-Learning equation in Deep Q Network

2 Answers2