1

Why use 2 networks, train once every episode and update target network every N episode, when we can use 1 network and train it ONCE every N episode! there is literally no difference!

1 Answers1

4

What you are describing is not Double DQN. The periodically updated target network is a core feature of the original DQN algorithm (and all of its derivatives). DeepMind's classic paper explains why it is crucial to have two networks:

The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets y_j in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q^ and use Q^ for generating the Q-learning targets y_j for the following C updates to Q. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(s_t, a_t) often also increases Q(s_{t+1}, a) for all a and hence also increases the target y_j, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets y_j, making divergence or oscillations much more unlikely.

Brett Daley
  • 544
  • 3
  • 6
  • yes i know what Double DQN is! i don't understand why we should use it! e.g: instead of training after every episode and update the target network every 100 episode why we just don't collect the data of the 100 training, then train it together! our main network will give same output as using target network, because we didn't train it during the last 100 episode, so we will get same results as using Double DQN! Well i already found the answer i think, using double dqn is better because training the main network every episode will have an effect on choosing actions. –  Jan 25 '20 at 20:45
  • 1
    That is correct, you could delay learning like that and eliminate the target network, but you would be choosing actions based on the old policy and theoretically learn more slowly. On the other hand, it doesn't always matter much in practice; I just [published a paper](https://arxiv.org/abs/1810.09967) with an algorithm called DQN(λ) that doesn't use a target network like you are describing and it still achieves competitive results on a subset of the Atari 2600 games. – Brett Daley Jan 25 '20 at 21:27
  • @BrettDaley Sorry, I still don't understand why two networks are better than one. as we only update the prediction network and only update the target network every 100 episode. It means we are still using the old policy. so why theoretically two networks are learning faster? – shtse8 Aug 01 '20 at 10:11
  • It can be useless for easy problems, but for more complex problems it is important thing that allows us not to converge to local optima. – Mikhail May 21 '21 at 10:45