1

In distributed tensorflow, I used SyncReplicasOptimizerV2 to aggregate and update gradients, But when one of the regular workers(chief worker most of times) training finished, the other regular worker will hanged. How can I solve this problem.

OS: Ubuntu 14.04

tensorflow version: 0.12.0-rc1

my code is here: https://github.com/xiaop1987/tf_distribute_lr

-----------------------------Update 1------------2016-12-20-------------

I apply sync queue as Yaroslav Bulatov suggested, Now I can stop the ParameterServer successfully, but the other worker still hanged there, and the call stack as follows:

enter image description here

Tianjin Gu
  • 784
  • 6
  • 17
  • Do your parameter server (ps) workers hang, or do your regular (worker) ones hang as well? – Yaroslav Bulatov Dec 19 '16 at 17:54
  • Thanks for your attention, just regular worker hanged, but when all other workers hanged, the ps workers' cpu usage will stay 0%, so I'm not very sure whether the ps workers hanged. – Tianjin Gu Dec 20 '16 at 04:31
  • 1
    This sounds possibly like intended behavior -- with SyncReplicas, the step only happens when all workers have finished their updates. So if one worker dies, the step will never complete. So perhaps when one worker is done you could kill all the other workers manually. You could do some trick with shared queues to have PS workers die when any worker completes, dead PS will also kill the rest of the workers. – Yaroslav Bulatov Dec 20 '16 at 05:09
  • Thanks, but the solutions sounds a little trick for 2 reasons: 1. If we kill all the other workers once I have one worker finished may cause some data not trained. 2. I'm not sure what is the exit code of the workers killed because of the dead PS, and I need to check whether my tensorflow task run succeed or failed by the exit code. Is there a way to sync all training process of the workers, and exit together? – Tianjin Gu Dec 20 '16 at 05:43
  • 1
    1. You could pad your data so that all workers see same amount of data and exit together. 2. You could use queues as signaling mechanism to tell other workers to quit gracefully, here's an example -- http://stackoverflow.com/questions/39810356/shut-down-server-in-tensorflow/40186129#40186129 – Yaroslav Bulatov Dec 20 '16 at 06:03
  • That looks wonderful, thanks a lot. – Tianjin Gu Dec 20 '16 at 06:06
  • @YaroslavBulatov - could you move your comment to an answer? :) – dga Nov 18 '17 at 15:19

0 Answers0