8

I am trying to run a distributed GCMLE training job and I keep getting the following error:

An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error

The trainer package is a custom estimator modeled much in the same was as the cloudml-samples census custom estimator: https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/customestimator/trainer. It's safe to say the task.py files are pretty much identical and within the model.py file the input_fn() and parse_csv() functions are the same and the only real differences are within the specifics of my model_fn().

If I configure my model to run with a single standard_p100 GPU I can train at ~15 steps/sec. However, if I update my configuration to a distributed setting with 4 workers and 3 parameter servers (see config below) then the preemption error pops up and 10 steps will take ~600 seconds...

config-distributed.yaml:

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_p100
  workerType: standard_p100
  parameterServerType: large_model
  workerCount: 3
  parameterServerCount: 3

If I use this same configuration with the census custom estimator sample then the model trains faster as expected and doesn't run into the preemption errors. I've tried modifying the census example to more closely mimick my exact code, but still haven't been able to reproduce the error.

Has anybody encountered similar preemption issues when trying to train a distributed ml engine job? Any advice on how I could better debug the issue? The only advice I have found online was suggesting to have the number of parameter servers be at least half the number of workers (which is why I upped to 3 parameter servers) but I'm still having no luck.

To add more context from the logs, this is a typical (repeated) pattern that happens when I try to train in a distributed setting:

master-replica-0 loss = 16.5019, step = 53 (124.505 sec)
master-replica-0 An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error
master-replica-0 Graph was finalized.
master-replica-0 Restoring parameters from gs://.../model.ckpt-0
master-replica-0 Running local_init_op.
master-replica-0 Done running local_init_op.
master-replica-0 Saving checkpoints for 0 into gs://...
master-replica-0 Skip the current checkpoint eval due to throttle secs (600 secs).
master-replica-0 An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error

And then this cycle repeats...

UPDATE

I increased the number of parameter servers to 10 and the steps/sec juped between 5-10 (still less than 15 with only a single GPU) and the error still happened but a little bit more sporadically. What would this suggest if more parameter servers helped? The CPU and Memory utlization is very low regardless (< 5-10%) so it doesn't seem like the PS's are overloaded, but the model does have a lot of variables to update (50k word embeddings w/ a large number of dimensions). Could this somehow be contributing to the problems?

reese0106
  • 2,011
  • 2
  • 16
  • 46
  • I get this same error with some datasets and not others (I'm doing object detection). Have you figured out the exact source? I have raised the number of parameter servers to 17 and keep getting the error. Have you figured out at least how to debug it? – wircho Dec 20 '18 at 16:45
  • I believe that the error was that I was exceeding network bandwidth -- I believe GCMLE network bandwidth is capped by 2Gbps / cpu, and max at 16 Gbps. It was suggested to me to run multi-GPU on a single VM. – reese0106 Dec 24 '18 at 03:36

1 Answers1

0

The bottleneck in distributed training with lots of parameters is often the network bandwidth. If you saturate the network too much, packets get lost and TensorFlow thinks the parameter server is down. By adding more parameter servers you are able to distribute the network load.

Keep in mind that if your model is amenable to using GPUs, you will generally get much better throughput on a single machine with 8 GPUs rather than 8 machines with a single GPU because you won't have any networking overhead.

rhaertel80
  • 8,254
  • 1
  • 31
  • 47
  • Thanks! This is very helpful to understand why the increase in parameter servers would be helping. How would you define "amenable to using GPUs" -- this model uses convolution layers, dense layers and batch normalization (so I would think it is amenable to using GPUs). However, if I use a single machine with 8 GPUs I don't get a speed up because I imagine I have not properly constructed the graph to use each of the 8 towers, etc. -- I admittedly tried and failed to accomplish this. As a "quicker" approach that required less code change, I was hoping to use 8 machines with a single GPU – reese0106 Oct 17 '18 at 14:19
  • Are there any resources you would recommend on how to take advantage of a single machine with 8 GPUs? Are there any `cloudml-samples` that involve creating a "custom estimator" that can work across multiple GPUs? – reese0106 Oct 17 '18 at 14:21