I am trying to run a distributed GCMLE training job and I keep getting the following error:
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error
The trainer package is a custom estimator modeled much in the same was as the cloudml-samples
census custom estimator: https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/customestimator/trainer. It's safe to say the task.py
files are pretty much identical and within the model.py
file the input_fn()
and parse_csv()
functions are the same and the only real differences are within the specifics of my model_fn()
.
If I configure my model to run with a single standard_p100
GPU I can train at ~15 steps/sec. However, if I update my configuration to a distributed setting with 4 workers and 3 parameter servers (see config below) then the preemption error pops up and 10 steps will take ~600 seconds...
config-distributed.yaml:
trainingInput:
scaleTier: CUSTOM
masterType: standard_p100
workerType: standard_p100
parameterServerType: large_model
workerCount: 3
parameterServerCount: 3
If I use this same configuration with the census custom estimator sample then the model trains faster as expected and doesn't run into the preemption errors. I've tried modifying the census example to more closely mimick my exact code, but still haven't been able to reproduce the error.
Has anybody encountered similar preemption issues when trying to train a distributed ml engine job? Any advice on how I could better debug the issue? The only advice I have found online was suggesting to have the number of parameter servers be at least half the number of workers (which is why I upped to 3 parameter servers) but I'm still having no luck.
To add more context from the logs, this is a typical (repeated) pattern that happens when I try to train in a distributed setting:
master-replica-0 loss = 16.5019, step = 53 (124.505 sec)
master-replica-0 An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error
master-replica-0 Graph was finalized.
master-replica-0 Restoring parameters from gs://.../model.ckpt-0
master-replica-0 Running local_init_op.
master-replica-0 Done running local_init_op.
master-replica-0 Saving checkpoints for 0 into gs://...
master-replica-0 Skip the current checkpoint eval due to throttle secs (600 secs).
master-replica-0 An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error
And then this cycle repeats...
UPDATE
I increased the number of parameter servers to 10 and the steps/sec juped between 5-10 (still less than 15 with only a single GPU) and the error still happened but a little bit more sporadically. What would this suggest if more parameter servers helped? The CPU and Memory utlization is very low regardless (< 5-10%) so it doesn't seem like the PS's are overloaded, but the model does have a lot of variables to update (50k word embeddings w/ a large number of dimensions). Could this somehow be contributing to the problems?