Trouble scaling wide and deep model to train on google cloud ML

Question

I'm trying to build a wide and deep tensorflow model and train it on google cloud.

I've been able to do this and train smaller dev versions.

However i'm now trying to scale up to more data and more training steps and my online training jobs keep failing.

It runs for 5 mins or so and then i get the below error:

The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. 
To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_123542&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_123542%22

When i look at the logs i can see these errors seem to be the issue:

Command '['gsutil', '-q', 'cp', u'gs://pmc-ml/clickmodel/vy/output/packages/4fc20b9f4b7678fd97c8061807d18841050bd95dbbff16a6b78961303203e032/trainer-0.0.0.tar.gz', u'trainer-0.0.0.tar.gz']' returned non-zero exit status 1

I'm not really sure whats going on here. I have a feeling it could be to do with the machine types i am training the model on. But i've tried moving from "STANDARD_1" to "PREMIUM_1" and i have also tried using custom machine types of "complex_model_l" and "large_model" for the parameter server.

There is about 1400 features in my data that i am using.

I'm only training it on one day of data for 1000 steps and i have reduced down the batch size a lot. I can train it locally like this but it's when i try train it in the cloud (even with such small number of steps) i'm hitting this error.

I'm not really sure what to try next...

It looks like maybe the gsutil command is maybe copying a packaged version of the model to a local worker and that's causing the issue. Didn't think 1400 features for a wide and deep model would be enough for me to have to worry about my model being too big. So not really sure i'm right here in what i think is going on as i would have expected using the other machine types and a custom config would have solved this.

p.s. here is the yaml for the custom config i am using:

trainingInput:
  scaleTier: CUSTOM
  masterType: large_model
  workerType: large_model
  parameterServerType: large_model
  workerCount: 15
  parameterServerCount: 10

And my call to train the model is like:

  gcloud ml-engine jobs submit training $JOB_NAME \
    --stream-logs \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.2 \
    --config $CONFIG \
    --module-name trainer.task \
    --package-path $PACKAGE_PATH \
    --region $REGION \
    --scale-tier CUSTOM \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 1000 \
    --verbosity DEBUG  \
    --eval-steps 100 \
    --num-layers 2 \
    --first-layer-size 200 \
    --scale-factor 0.99

$OUTPUT_PATH above here is just one day of data - so i'm pretty sure my issue is not with too much data in terms of rows of input and steps. My batch size is 100 also.

UPDATE I ran a hyperparam tuning job as actually thats what was working for me before. Here is the job info:

clickmodel_train_20171023_154805

 Failed (10 min 19 sec)
Creation time   
Oct 23, 2017, 4:48:08 PM
Start time  
Oct 23, 2017, 4:48:12 PM
End time    
Oct 23, 2017, 4:58:27 PM
Logs    
View logs
Error message   
Hyperparameter Tuning Trial #1 Failed before any other successful trials were completed. The failed trial had parameters: num-layers=11, scale-factor=0.47899098586647881, first-layer-size=498, . The trial's error message was: The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module> tf.gfile.DeleteRecursively(args.job_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) PermissionDeniedError: could not fully delete dir To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22
Training input  
{
  "scaleTier": "CUSTOM",
  "masterType": "large_model",
  "workerType": "standard_gpu",
  "parameterServerType": "large_model",
  "workerCount": "10",
  "parameterServerCount": "5",
  "packageUris": [
    "gs://pmc-ml/clickmodel/vy/output/packages/326616fb7bab86d0d534c03f3260a0ff38c86112850b478ba28eca1e9d12d092/trainer-0.0.0.tar.gz"
  ],
  "pythonModule": "trainer.task",
  "args": [
    "--train-files",
    "gs://pmc-ml/clickmodel/vy/data/train_data_20170901*.csv",
    "--eval-files",
    "gs://pmc-ml/clickmodel/vy/data/dev_data_20170901*.csv",
    "--train-steps",
    "1000",
    "--verbosity",
    "DEBUG",
    "--eval-steps",
    "100",
    "--num-layers",
    "2",
    "--first-layer-size",
    "200",
    "--scale-factor",
    "0.99",
    "--train-batch-size",
    "100",
    "--eval-batch-size",
    "100"
  ],
  "hyperparameters": {
    "goal": "MAXIMIZE",
    "params": [
      {
        "parameterName": "first-layer-size",
        "minValue": 50,
        "maxValue": 500,
        "type": "INTEGER",
        "scaleType": "UNIT_LINEAR_SCALE"
      },
      {
        "parameterName": "num-layers",
        "minValue": 1,
        "maxValue": 15,
        "type": "INTEGER",
        "scaleType": "UNIT_LINEAR_SCALE"
      },
      {
        "parameterName": "scale-factor",
        "minValue": 0.1,
        "maxValue": 1,
        "type": "DOUBLE",
        "scaleType": "UNIT_REVERSE_LOG_SCALE"
      }
    ],
    "maxTrials": 12,
    "maxParallelTrials": 2,
    "hyperparameterMetricTag": "accuracy"
  },
  "region": "us-central1",
  "runtimeVersion": "1.2",
  "jobDir": "gs://pmc-ml/clickmodel/vy/output"
}

but i am getting below errors now:

16:58:06.188
The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module> tf.gfile.DeleteRecursively(args.job_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) PermissionDeniedError: could not fully delete dir To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22
Expand all | Collapse all {
 insertId:  "w77g2yg1zqa5fl"  
 logName:  "projects/pmc-analytical-data-mart/logs/ml.googleapis.com%2Fclickmodel_train_20171023_154805"  
 receiveTimestamp:  "2017-10-23T15:58:06.188221966Z"  
 resource: {…}  
 severity:  "ERROR"  
 textPayload:  "The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module>
    tf.gfile.DeleteRecursively(args.job_dir)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively
    pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
PermissionDeniedError: could not fully delete dir

To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22"  
 timestamp:  "2017-10-23T15:58:06.188221966Z"  
}

So does look like maybe a permission indeed. I had added cloud-logs@google.com, cloud-ml-service@pmc-analytical-data-mart-8c548.iam.gserviceaccount.com, cloud-ml@google.com as admins to the pmc-ml bucket. I wonder is there something else i'm missing.

ANOTHER UPDATE

I also see these errors now in the logs as well, not sure if they are related or not though:

{
 insertId:  "1986fw7g2uya0b9"  
 jsonPayload: {
  created:  1508774246.95985   
  levelname:  "ERROR"   
  lineno:  335   
  message:  "2017-10-23 15:57:26.959642: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"   
  pathname:  "/runcloudml.py"   
 }
 labels: {
  compute.googleapis.com/resource_id:  "7863680028519935658"   
  compute.googleapis.com/resource_name:  "worker-f13b3addb0-7-s6dxq"   
  compute.googleapis.com/zone:  "us-central1-c"   
  ml.googleapis.com/job_id:  "clickmodel_train_20171023_154805"   
  ml.googleapis.com/job_id/log_area:  "root"   
  ml.googleapis.com/task_name:  "worker-replica-7"   
  ml.googleapis.com/trial_id:  "1"   
 }
 logName:  "projects/pmc-analytical-data-mart/logs/worker-replica-7"  
 receiveTimestamp:  "2017-10-23T15:57:32.288280956Z"  
 resource: {
  labels: {
   job_id:  "clickmodel_train_20171023_154805"    
   project_id:  "pmc-analytical-data-mart"    
   task_name:  "worker-replica-7"    
  }
  type:  "ml_job"   
 }
 severity:  "ERROR"  
 timestamp:  "2017-10-23T15:57:26.959845066Z"  
}

And

{
 insertId:  "11qijbbg2nchav0"  
 jsonPayload: {
  created:  1508774068.64571   
  levelname:  "ERROR"   
  lineno:  335   
  message:  "2017-10-23 15:54:28.645519: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"   
  pathname:  "/runcloudml.py"   
 }
 labels: {
  compute.googleapis.com/resource_id:  "2962580336091050416"   
  compute.googleapis.com/resource_name:  "worker-a28b8b5d9c-8-ch8kg"   
  compute.googleapis.com/zone:  "us-central1-c"   
  ml.googleapis.com/job_id:  "clickmodel_train_20171023_154805"   
  ml.googleapis.com/job_id/log_area:  "root"   
  ml.googleapis.com/task_name:  "worker-replica-8"   
  ml.googleapis.com/trial_id:  "2"   
 }
 logName:  "projects/pmc-analytical-data-mart/logs/worker-replica-8"  
 receiveTimestamp:  "2017-10-23T15:54:59.620612418Z"  
 resource: {
  labels: {
   job_id:  "clickmodel_train_20171023_154805"    
   project_id:  "pmc-analytical-data-mart"    
   task_name:  "worker-replica-8"    
  }
  type:  "ml_job"   
 }
 severity:  "ERROR"  
 timestamp:  "2017-10-23T15:54:28.645709991Z"  
}

I might strip my input data files right back to be just 10 or so features so that's one less variable. And then i'll rerun the same hyperparam job to see if i just see the permission errors next time and if so we can focus on just that first. Looks like the other two are memeory ones so maybe i just need bigger machines or smaller batches - reckon i'd be able to google my way out of that one myself...i think... :)

PARTIAL SOLUTION

Ok so after a lot of messing around i think i had two issues.

i was reusing the same output job-dir (gs://pmc-ml/clickmodel/vy/output) each time i was running a job - i think this was causing some issues when job's failed and the next job then could not fully delete some left over files for whatever reason. Not 100% sure if this really was an issue but seems like better practice to have a new output folder for each job.
i was passing "--scale-tier STANDARD_1" as an argument and this seems to be what causes the problems (did i just make this argument up? - if so it's strange it does not throw an error on validating the job).

So this works:

gcloud ml-engine jobs submit training test_023 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_023 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG

But this fails:

gcloud ml-engine jobs submit training test_024 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_024 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
--scale-tier STANDARD_1 \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG

So i think my issue was that when i tried to start scaling up with a wider model and lots of data i started passing some machine config type stuff via the command line args. I'm not sure if i was doing this correct. It seems like i might be best leaving them in the hptuning_config.yaml file and trying to scale out using a call like this:

gcloud ml-engine jobs submit training test_022 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_022 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
--config /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/hptuning_config.yaml \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG

where hptuning_config.yaml looks like:

trainingInput:
  scaleTier: CUSTOM
  masterType: large_model
  workerType: standard_gpu
  parameterServerType: large_model
  workerCount: 10
  parameterServerCount: 5
  hyperparameters:
    goal: MAXIMIZE
    hyperparameterMetricTag: accuracy
    maxTrials: 5
    maxParallelTrials: 2
    params:
      - parameterName: first-layer-size
        type: INTEGER
        minValue: 20
        maxValue: 500
        scaleType: UNIT_LINEAR_SCALE
      - parameterName: num-layers
        type: INTEGER
        minValue: 1
        maxValue: 15
        scaleType: UNIT_LINEAR_SCALE
      - parameterName: scale-factor
        type: DOUBLE
        minValue: 0.01
        maxValue: 1.0
        scaleType: UNIT_REVERSE_LOG_SCALE

So i will try now to add back in all my features, and train on 1 day, and then try scale up to more days and training steps etc.

As to the passing "--scale-tier STANDARD_1" i'm not really sure what the root cause is here or if there may be a bug or not. Originally i was thinking rather then worry count the different machine types etc i would just pass "--scale-tier PREMIUM_1" when submitting the job and so (hopefully) not have to worry about the actual machine types etc. So i think there may still be some sort of issue here maybe.

I'm guessing this is a permissions problem. Is gs://pmc-ml owned by the same project you are working in (`gcloud config get-value core/project`)? You may want to run through: https://cloud.google.com/ml-engine/docs/working-with-data#using_a_cloud_storage_bucket_from_a_different_project just in case. Can you also provide the output of `gsutil -l -h gs://pmc-ml/clickmodel/vy/output/packages/4fc20b9f4b7678fd97c8061807d18841050bd95dbbff16a6b78961303203e032/trainer-0.0.0.tar.gz` — rhaertel80, Oct 23 '17 at 14:22
yep - just checked again there - its all in the same project. I'm getting `option -l not recognized` for the gsutil -l -h command...? — andrewm4894, Oct 23 '17 at 15:28
also the cloud runs used to work fine when i had less features in the model. it's only since adding my last batch of features that this has started. I'm try maybe set all the recent batch of features i added to be unused columns and/or reexport my data with the fewer features to see if i can get it working again, just to try confirm my one idea that it might be related to the number of features and so params in some way. — andrewm4894, Oct 23 '17 at 15:31
hmmm - i just kicked off a job with less features there and got same errors so maybe i'm wrong about the features idea. — andrewm4894, Oct 23 '17 at 15:42
This is failing when trying to copy your .tar.gz file. I meant to ask you to us `gsutil ls` (forgot the ls). I just ran `subprocess.check_call(['gsutil', '-q', 'cp', u'gs://pmc-ml/clickmodel/vy/output/packages/4fc20b9f4b7678fd97c8061807d18841050bd95dbbff16a6b78961303203e032/trainer-0.0.0.tar.gz', u'trainer-0.0.0.tar.gz'])` -- obviously I don't have access to this file and I got the same error as you, but there are more lines to the stack trace, namely, "AccessDeniedException: 403 rhaertel@google.com does not have storage.objects.list access to pmc-ml." Do you have more context around the err? — rhaertel80, Oct 23 '17 at 22:25
@rhaertel80 happy to add you to the project if might be easier for you to look around in there? — andrewm4894, Oct 24 '17 at 09:01
i've trimmed it all back to just 10 or so features and indeed the only error i'm now getting is "Command '['gsutil', '-q', 'cp', u'gs://pmc-ml/clickmodel/vy/rand_output/packages/b4ee7cbebda05eb84e9ba116ffe05fb739fcbfa86c51be9a906a4e866e336b9d/trainer-0.0.0.tar.gz', u'trainer-0.0.0.tar.gz']' returned non-zero exit status 1" — andrewm4894, Oct 24 '17 at 10:14
`gsutil ls -l -h gs://pmc-ml/clickmodel/vy/rand_output/packages/b4ee7cbebda05eb84e9ba116ffe05fb739fcbfa86c51be9a906a4e866e336b9d/t rainer-0.0.0.tar.gz` returns `CommandException: One or more URLs matched no objects.` would it have cleaned away this folder when the job failed i wonder? — andrewm4894, Oct 24 '17 at 10:16
Based on the other error message above, it looks like your task.py is trying to recursively delete a directory, and it looks like your packages live in that directory. Consider putting your packages in their own directory separate from the output of any given job. `gcloud` can do this for you if you use the `--staging-bucket` parameter and a local path to the package. — rhaertel80, Oct 24 '17 at 14:10

score 0 · Answer 1 · answered Oct 24 '17 at 14:18

It looks like there are multiple issues here:

Missing packages
Recursive deletion error
GPU runs out of memory

Missing packages. It looks like you are referencing a package-path that happens to be in the output folder of a job. It is likely that output folder is getting deleted (see #2). To prevent this, put your packages in their own folder independent of any jobs. You can do this by using the --staging-dir option of gcloud when submitting jobs and providing a local --package-path.

Recursive deletion error. The task.py you are submitting is trying to delete a directory -- likely the output directory. There may be various reasons for this failure, likely insufficient permissions for the CloudML service to delete one or more files that exist. Check the ACLs, or consider creating a new output directory each run.

GPU out-of-memory error. K80s only have 12 GB of RAM. So either reduce the size of your model (e.g. fewer input features, smaller layers, etc.). In this case, you might consider putting the initial lookups on the CPU since they probably don't benefit from the GPU anyways. That may be more difficult if you are using a "canned" estimator (e.g. DNNEstimator), which may not give you sufficient control to do so. In that case, either don't use GPUs or you'll have to write your own Model code.

thanks - i just added some info to the question. the tldr; is i think i was passing in some funny combination of arguments that caused this issue. i've refactored what i was doing and think i have more clarity on the approach now of defining machine types etc in the config.yaml file as opposed to trying to pass those args direct to gcloud. I still think there might be something funny going on with "--scale-tier STANDARD_1" though that might be a little bug or something. — andrewm4894, Oct 24 '17 at 14:37
p.s. you got any tips for things i might come up against - this model will eventually be about 1500 features wide and i'm hoping to train it on a month's worth of data which is about 2 million rows a day (~1.5GB csv per part of each day file). e.g. what might be a good combo of , masterType, workerType, parameterServerType, workerCount, parameterServerCount ? thanks for your help btw. — andrewm4894, Oct 24 '17 at 14:42
What error do you get when you specify "--scale-tier STANDARD_1"? — Jing Jing Long, Oct 24 '17 at 15:46

Trouble scaling wide and deep model to train on google cloud ML

1 Answers1