Questions About Deep Q-Learning

Question

I read several materials about deep q-learning and I'm not sure if I understand it completely. From what I learned, it seems that Deep Q-learning calculates faster the Q-values rather than putting them on a table by using NN to perform a regression, calculating loss and backpropagating the error to update the weights. Then, in a testing scenario, it takes a state and the NN will return several Q-values for each action possible for that state. Then, the action with the highest Q-value will be chosen to be done at that state.

My only question is how the weights are updated. According to this site the weights are updated as follows:

I understand that the weights are initialized randomly, R is returned by the environment, gamma and alpha are set manually, but I dont understand how Q(s',a,w) and Q(s,a,w) are initialized and calculated. Does it seem that we should build a table of Q-values and update them as with Q-learning or they are calculated automatically at each NN training epoch? what I am not understanding here? can somebody explain to me better such an equation?

score 2 · Accepted Answer · answered Jun 26 '19 at 16:57

2

In Q-Learning, we are concerned with learning the Q(s, a) function which is a mapping between a state to all actions. Say you have an arbitrary state space and an action space of 3 actions, each of these states will compute to three different values, each an action. In tabular Q-Learning, this is done with a physical table. Consider the following case:

Here, we have a Q table for each state in the game (upper left). And after each time step, the Q value for that specific action is updated according to some reward signal. The reward signal can be discounted by some value between 0 and 1.

In Deep Q-Learning, we disregard the use of tables and create a parametrized "table" such as this: Here, all of the weights will form combinations given on the input that should appromiately match the value seen in the tabular case (Still actively researched).

The equation you presented is the Q-learning update rule set in a gradient update rule.

alpha is the step-size
R is the reward
Gamma is the discounting factor You do inference of the network to retrieve the value of the "discounted future state" and subtract this with the "current" state. If this is unclear, I recommend you to look up boostrapping which is basicly what is happening here.

answered Jun 26 '19 at 16:57

Per Arne Andersen

494
5
17

Thanks for your answer. I just have doubts on how the Q’s are initialized, if they are updated or just calculated at every step of the algorithm. – mad Jun 26 '19 at 18:45
1

This can be done as you wish for tabular case. One way would be to update for every s, a, s1 pair, another would be to wait until certain number of timestep has reached, or when terminal state occurs. Really up to you. But the most "simplest" is one-step TD – Per Arne Andersen Jun 27 '19 at 03:45
Another question: are these Q’s disposed in a Table? or are calculated on-the-fly? because if the former is true, I don’t understand why DQNs are most effective than Q-learning as the big table construction step is also needed. Thanks again! – mad Jun 27 '19 at 05:26
In Tabular Q-Learning you would have to have 1 row with corresponding cations for each state in the state-space. Now, this is easy and feasible enough for Gridworld and very simple environments with few actions, but for state spaces in the billions, this becomes hard. Consider a case where you have 80x80x3 rgb images as your state. I might do it wrong here, but this would lead to (80x80x3)^255 different combination of states that could be in your table. DQN are not more effective, it is only better in general because it makes it feasable to encode/generalise to large tables (state-spaces) – Per Arne Andersen Jun 27 '19 at 16:50

Questions About Deep Q-Learning

1 Answers1