I am trying to train a model using the canned DNNClassifier estimator on the google cloud ml-engine.
I am able to successfully train the model locally in single and distributed mode. Further I am able to train the model on the cloud with the provided BASIC and BASIC_GPU scale-tier.
I am now trying to pass my own custom config file. When I only specify "masterType: standard" in the config file without mentioning workers, parameter servers, the job runs successfully.
However, whenever I try adding workers, the job fails:
trainingInput:
scaleTier: CUSTOM
masterType: standard
workerType: standard
workerCount: 4
Here is how I run the job (I get the same error without mentioning the staging bucket):
SCALE_TIER=CUSTOM
JOB_NAME=chasingdatajob_10252017_13
OUTPUT_PATH=gs://chasingdata/$JOB_NAME
STAGING_BUCKET=gs://chasingdata
gcloud ml-engine jobs submit training $JOB_NAME --staging-bucket "$STAGING_BUCKET" --scale-tier $SCALE_TIER --config $SIMPLE_CONFIG --job-dir $OUTPUT_PATH --module-name trainer.task --package-path trainer/ --region $REGION -- ...
My job log shows that the job exited with a non-zero status of 1. I see the following error for worker-replica-3:
Command '['gsutil', '-q', 'cp', u'gs://chasingdata/chasingdatajob_10252017_13/e476e75c04e89e4a0f2f5f040853ec21974ae0af2289a2563293d29179a81199/trainer-0.1.tar.gz', u'trainer-0.1.tar.gz']' returned non-zero exit status 1
Ive checked my bucket (gs://chasingdata). I see chasingdatajob_10252017_13 directory created by the engine but there is no trainer-0.1.tar.gz file. Another thing to mention - I am passing "tensorflow==1.4.0rc0" as a PyPi package to the cloud in my setup.py file. I dont think this is the cause of the problem but thought Id mention it anyway.
Is there any reason for this error? Can someone please help me out?
Perhaps I am doing something stupid. I have tried to find an answer (unsuccesfully) for this.
Thanks a lot!!