0

What is the idea behind double QN?

The Bellman equation used to calculate the Q values to update the online network follows the equation:

value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]

The Bellman equation used to calculate the Q value updates in the original DQN is:

value = reward + discount_factor * max(target_network.predict(next_state))

but the target network for evaluating the action is updated using weights of the online_network and the value and fed to the target value is basically old q value of the action.

any ideas how adding another networks based on weights from the first network helps?

joseph
  • 181
  • 9
  • 1
    [Artificial Intelligence Stack Exchange](https://ai.stackexchange.com/) is probably a better place to ask theoretical questions related to reinforcement learning, so I suggest that you ask your question there (although I think has already been asked there). If you ask it there, please, delete it from here (to avoid cross-posting, which is generally discouraged). – nbro Jul 10 '20 at 12:07

1 Answers1

0

I really liked the explanation from here: https://becominghuman.ai/beat-atari-with-deep-reinforcement-learning-part-2-dqn-improvements-d3563f665a2c

"This is actually quite simple: you probably remember from the previous post that we were trying to optimize the Q function defined as follows:

Q(s, a) = r + γ maxₐ’(Q(s’, a’))

Because this definition is recursive (the Q value depends on other Q values), in Q-learning we end up training a network to predict its own output, as we pointed out last time.

The problem of course is that at each minibatch of training, we are changing both Q(s, a) and Q(s’, a’), in other words, we are getting closer to our target but also moving our target! This can make it a lot harder for our network to converge.

It thus seems like we should instead use a fixed target so as to avoid this problem of the network “chasing its own tail”, but of course that isn’t possible since the target Q function should get better and better as we train."

ThelVadam
  • 101
  • 1
  • 3