The A3C Algorithm (and N-Step Q Learning) updates the globaly shared network once every N timesteps. N is usually pretty small, 5 or 20 as far as I remember.
Wouldn't it be possible to set N to infinity, meaning that the networks are only trained at the end of an episode? I do not argue that it is necessarily better - tough, for me it sounds like it could be - but at least it should not be a lot worse, right?
The lacking asynchronous training based on the asynchronous exploration of the enviroment by multiple agents in different enviroments, and therefore the stabilization of the training procedure without replay memory, might be a problem if the training is done sequentially (as in: for each worker thread, train the network on the whole observed SAR-sequence). Tough, the training could still be done asynchronously with sub-sequences, it would only make training with stateful LSTMs a little bit more complicated.
The reason why I am asking is the "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" paper. To compare it to algorithms like A3C, it would make more sense - from a code engineering point of view - to train both algorithms in the same episodic way.