Tensorflow Canned Estimator problems running with multiple workers on Google cloud ml engine

Question

I am trying to train a model using the canned DNNClassifier estimator on the google cloud ml-engine.

I am able to successfully train the model locally in single and distributed mode. Further I am able to train the model on the cloud with the provided BASIC and BASIC_GPU scale-tier.

I am now trying to pass my own custom config file. When I only specify "masterType: standard" in the config file without mentioning workers, parameter servers, the job runs successfully.

However, whenever I try adding workers, the job fails:

trainingInput:
  scaleTier: CUSTOM
  masterType: standard
  workerType: standard
  workerCount: 4

Here is how I run the job (I get the same error without mentioning the staging bucket):

SCALE_TIER=CUSTOM
JOB_NAME=chasingdatajob_10252017_13
OUTPUT_PATH=gs://chasingdata/$JOB_NAME
STAGING_BUCKET=gs://chasingdata
gcloud ml-engine jobs submit training $JOB_NAME --staging-bucket "$STAGING_BUCKET" --scale-tier $SCALE_TIER --config $SIMPLE_CONFIG --job-dir $OUTPUT_PATH --module-name trainer.task --package-path trainer/ --region $REGION -- ...

My job log shows that the job exited with a non-zero status of 1. I see the following error for worker-replica-3:

Command '['gsutil', '-q', 'cp', u'gs://chasingdata/chasingdatajob_10252017_13/e476e75c04e89e4a0f2f5f040853ec21974ae0af2289a2563293d29179a81199/trainer-0.1.tar.gz', u'trainer-0.1.tar.gz']' returned non-zero exit status 1

Ive checked my bucket (gs://chasingdata). I see chasingdatajob_10252017_13 directory created by the engine but there is no trainer-0.1.tar.gz file. Another thing to mention - I am passing "tensorflow==1.4.0rc0" as a PyPi package to the cloud in my setup.py file. I dont think this is the cause of the problem but thought Id mention it anyway.

Is there any reason for this error? Can someone please help me out?

Perhaps I am doing something stupid. I have tried to find an answer (unsuccesfully) for this.

Thanks a lot!!

Can you provide a directory listing: `gsutil ls -l -h gs://chasingdata/chasingdatajob_10252017_13/e476e75c04e89e4a0f2f5f040853ec21974ae0af2289a2563293d29179a81199` — rhaertel80, Oct 26 '17 at 14:35
Sure. Here it is: `...@angular-vector-181314:~$ gsutil ls -l -h gs://chasingdata/chasingdatajob_10252017_13/e476e75c04e89e4a0f2f5f040853ec21974ae0af2289a2563293d29179a81199` CommandException: One or more URLs matched no objects. — MarquesDeCampo, Oct 26 '17 at 16:04
And `...@angular-vector-181314:~$ gsutil ls -l -h gs://chasingdata/chasingdatajob_10252017_13` 0 B 2017-10-25T19:25:10Z gs://chasingdata/chasingdatajob_10252017_13/ 77.55 KiB 2017-10-25T19:25:10Z gs://chasingdata/chasingdatajob_10252017_13/events.out.tfevents.1508959510.master-5252b8c60b-0-d522f TOTAL: 2 objects, 79410 bytes (77.55 KiB) — MarquesDeCampo, Oct 26 '17 at 16:07
@rhaertel80, thanks for your help. I have removed `--staging-bucket` and it still doesnt work. Ive played around with this a little bit and I think the problem is with workerCount. When I change the workerCount to 2, the job runs successfully. But when I set workerCount to 4, it fails with the above error. Am I missing something? — MarquesDeCampo, Oct 27 '17 at 18:07
Thats extremely bizarre given the symptoms. Are you willing to post code publicly or send privately to cloudml-feedback@google.com? — rhaertel80, Oct 27 '17 at 19:46

score 0 · Answer 1 · answered Oct 31 '17 at 22:42

0

The user code has the logic to delete existing job-dir, which deleted the staged user code package in GCS as well, so that the workers started late were not able to download the package.

We recommend each job has a separate job-dir to avoid similar issue.

answered Oct 31 '17 at 22:42

Guoqing Xu

482
3
9

Please note that this is fixed in the [census code](https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census) as well. – Puneith Kaul Dec 11 '17 at 23:50

Tensorflow Canned Estimator problems running with multiple workers on Google cloud ml engine

1 Answers1