Why use 2 networks, train once every episode and update target network every N episode, when we can use 1 network and train it ONCE every N episode! there is literally no difference!
1 Answers
What you are describing is not Double DQN. The periodically updated target network is a core feature of the original DQN algorithm (and all of its derivatives). DeepMind's classic paper explains why it is crucial to have two networks:
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets
y_j
in the Q-learning update. More precisely, everyC
updates we clone the networkQ
to obtain a target networkQ^
and useQ^
for generating the Q-learning targetsy_j
for the followingC
updates toQ
. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increasesQ(s_t, a_t)
often also increasesQ(s_{t+1}, a)
for alla
and hence also increases the targety_j
, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update toQ
is made and the time the update affects the targetsy_j
, making divergence or oscillations much more unlikely.

- 544
- 3
- 6
-
yes i know what Double DQN is! i don't understand why we should use it! e.g: instead of training after every episode and update the target network every 100 episode why we just don't collect the data of the 100 training, then train it together! our main network will give same output as using target network, because we didn't train it during the last 100 episode, so we will get same results as using Double DQN! Well i already found the answer i think, using double dqn is better because training the main network every episode will have an effect on choosing actions. – Jan 25 '20 at 20:45
-
1That is correct, you could delay learning like that and eliminate the target network, but you would be choosing actions based on the old policy and theoretically learn more slowly. On the other hand, it doesn't always matter much in practice; I just [published a paper](https://arxiv.org/abs/1810.09967) with an algorithm called DQN(λ) that doesn't use a target network like you are describing and it still achieves competitive results on a subset of the Atari 2600 games. – Brett Daley Jan 25 '20 at 21:27
-
@BrettDaley Sorry, I still don't understand why two networks are better than one. as we only update the prediction network and only update the target network every 100 episode. It means we are still using the old policy. so why theoretically two networks are learning faster? – shtse8 Aug 01 '20 at 10:11
-
It can be useless for easy problems, but for more complex problems it is important thing that allows us not to converge to local optima. – Mikhail May 21 '21 at 10:45