I'm trying to build a wide and deep tensorflow model and train it on google cloud.
I've been able to do this and train smaller dev versions.
However i'm now trying to scale up to more data and more training steps and my online training jobs keep failing.
It runs for 5 mins or so and then i get the below error:
The replica worker 2 exited with a non-zero status of 1. Termination reason: Error.
To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_123542&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_123542%22
When i look at the logs i can see these errors seem to be the issue:
Command '['gsutil', '-q', 'cp', u'gs://pmc-ml/clickmodel/vy/output/packages/4fc20b9f4b7678fd97c8061807d18841050bd95dbbff16a6b78961303203e032/trainer-0.0.0.tar.gz', u'trainer-0.0.0.tar.gz']' returned non-zero exit status 1
I'm not really sure whats going on here. I have a feeling it could be to do with the machine types i am training the model on. But i've tried moving from "STANDARD_1" to "PREMIUM_1" and i have also tried using custom machine types of "complex_model_l" and "large_model" for the parameter server.
There is about 1400 features in my data that i am using.
I'm only training it on one day of data for 1000 steps and i have reduced down the batch size a lot. I can train it locally like this but it's when i try train it in the cloud (even with such small number of steps) i'm hitting this error.
I'm not really sure what to try next...
It looks like maybe the gsutil command is maybe copying a packaged version of the model to a local worker and that's causing the issue. Didn't think 1400 features for a wide and deep model would be enough for me to have to worry about my model being too big. So not really sure i'm right here in what i think is going on as i would have expected using the other machine types and a custom config would have solved this.
p.s. here is the yaml for the custom config i am using:
trainingInput:
scaleTier: CUSTOM
masterType: large_model
workerType: large_model
parameterServerType: large_model
workerCount: 15
parameterServerCount: 10
And my call to train the model is like:
gcloud ml-engine jobs submit training $JOB_NAME \
--stream-logs \
--job-dir $OUTPUT_PATH \
--runtime-version 1.2 \
--config $CONFIG \
--module-name trainer.task \
--package-path $PACKAGE_PATH \
--region $REGION \
--scale-tier CUSTOM \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--verbosity DEBUG \
--eval-steps 100 \
--num-layers 2 \
--first-layer-size 200 \
--scale-factor 0.99
$OUTPUT_PATH above here is just one day of data - so i'm pretty sure my issue is not with too much data in terms of rows of input and steps. My batch size is 100 also.
UPDATE I ran a hyperparam tuning job as actually thats what was working for me before. Here is the job info:
clickmodel_train_20171023_154805
Failed (10 min 19 sec)
Creation time
Oct 23, 2017, 4:48:08 PM
Start time
Oct 23, 2017, 4:48:12 PM
End time
Oct 23, 2017, 4:58:27 PM
Logs
View logs
Error message
Hyperparameter Tuning Trial #1 Failed before any other successful trials were completed. The failed trial had parameters: num-layers=11, scale-factor=0.47899098586647881, first-layer-size=498, . The trial's error message was: The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module> tf.gfile.DeleteRecursively(args.job_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) PermissionDeniedError: could not fully delete dir To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22
Training input
{
"scaleTier": "CUSTOM",
"masterType": "large_model",
"workerType": "standard_gpu",
"parameterServerType": "large_model",
"workerCount": "10",
"parameterServerCount": "5",
"packageUris": [
"gs://pmc-ml/clickmodel/vy/output/packages/326616fb7bab86d0d534c03f3260a0ff38c86112850b478ba28eca1e9d12d092/trainer-0.0.0.tar.gz"
],
"pythonModule": "trainer.task",
"args": [
"--train-files",
"gs://pmc-ml/clickmodel/vy/data/train_data_20170901*.csv",
"--eval-files",
"gs://pmc-ml/clickmodel/vy/data/dev_data_20170901*.csv",
"--train-steps",
"1000",
"--verbosity",
"DEBUG",
"--eval-steps",
"100",
"--num-layers",
"2",
"--first-layer-size",
"200",
"--scale-factor",
"0.99",
"--train-batch-size",
"100",
"--eval-batch-size",
"100"
],
"hyperparameters": {
"goal": "MAXIMIZE",
"params": [
{
"parameterName": "first-layer-size",
"minValue": 50,
"maxValue": 500,
"type": "INTEGER",
"scaleType": "UNIT_LINEAR_SCALE"
},
{
"parameterName": "num-layers",
"minValue": 1,
"maxValue": 15,
"type": "INTEGER",
"scaleType": "UNIT_LINEAR_SCALE"
},
{
"parameterName": "scale-factor",
"minValue": 0.1,
"maxValue": 1,
"type": "DOUBLE",
"scaleType": "UNIT_REVERSE_LOG_SCALE"
}
],
"maxTrials": 12,
"maxParallelTrials": 2,
"hyperparameterMetricTag": "accuracy"
},
"region": "us-central1",
"runtimeVersion": "1.2",
"jobDir": "gs://pmc-ml/clickmodel/vy/output"
}
but i am getting below errors now:
16:58:06.188
The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module> tf.gfile.DeleteRecursively(args.job_dir) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) PermissionDeniedError: could not fully delete dir To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22
Expand all | Collapse all {
insertId: "w77g2yg1zqa5fl"
logName: "projects/pmc-analytical-data-mart/logs/ml.googleapis.com%2Fclickmodel_train_20171023_154805"
receiveTimestamp: "2017-10-23T15:58:06.188221966Z"
resource: {…}
severity: "ERROR"
textPayload: "The replica worker 4 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 193, in <module>
tf.gfile.DeleteRecursively(args.job_dir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 432, in delete_recursively
pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
PermissionDeniedError: could not fully delete dir
To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=642488228368&resource=ml_job%2Fjob_id%2Fclickmodel_train_20171023_154805&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22clickmodel_train_20171023_154805%22"
timestamp: "2017-10-23T15:58:06.188221966Z"
}
So does look like maybe a permission indeed. I had added cloud-logs@google.com, cloud-ml-service@pmc-analytical-data-mart-8c548.iam.gserviceaccount.com, cloud-ml@google.com as admins to the pmc-ml bucket. I wonder is there something else i'm missing.
ANOTHER UPDATE
I also see these errors now in the logs as well, not sure if they are related or not though:
{
insertId: "1986fw7g2uya0b9"
jsonPayload: {
created: 1508774246.95985
levelname: "ERROR"
lineno: 335
message: "2017-10-23 15:57:26.959642: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "7863680028519935658"
compute.googleapis.com/resource_name: "worker-f13b3addb0-7-s6dxq"
compute.googleapis.com/zone: "us-central1-c"
ml.googleapis.com/job_id: "clickmodel_train_20171023_154805"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/task_name: "worker-replica-7"
ml.googleapis.com/trial_id: "1"
}
logName: "projects/pmc-analytical-data-mart/logs/worker-replica-7"
receiveTimestamp: "2017-10-23T15:57:32.288280956Z"
resource: {
labels: {
job_id: "clickmodel_train_20171023_154805"
project_id: "pmc-analytical-data-mart"
task_name: "worker-replica-7"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2017-10-23T15:57:26.959845066Z"
}
And
{
insertId: "11qijbbg2nchav0"
jsonPayload: {
created: 1508774068.64571
levelname: "ERROR"
lineno: 335
message: "2017-10-23 15:54:28.645519: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "2962580336091050416"
compute.googleapis.com/resource_name: "worker-a28b8b5d9c-8-ch8kg"
compute.googleapis.com/zone: "us-central1-c"
ml.googleapis.com/job_id: "clickmodel_train_20171023_154805"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/task_name: "worker-replica-8"
ml.googleapis.com/trial_id: "2"
}
logName: "projects/pmc-analytical-data-mart/logs/worker-replica-8"
receiveTimestamp: "2017-10-23T15:54:59.620612418Z"
resource: {
labels: {
job_id: "clickmodel_train_20171023_154805"
project_id: "pmc-analytical-data-mart"
task_name: "worker-replica-8"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2017-10-23T15:54:28.645709991Z"
}
I might strip my input data files right back to be just 10 or so features so that's one less variable. And then i'll rerun the same hyperparam job to see if i just see the permission errors next time and if so we can focus on just that first. Looks like the other two are memeory ones so maybe i just need bigger machines or smaller batches - reckon i'd be able to google my way out of that one myself...i think... :)
PARTIAL SOLUTION
Ok so after a lot of messing around i think i had two issues.
- i was reusing the same output job-dir (gs://pmc-ml/clickmodel/vy/output) each time i was running a job - i think this was causing some issues when job's failed and the next job then could not fully delete some left over files for whatever reason. Not 100% sure if this really was an issue but seems like better practice to have a new output folder for each job.
- i was passing "--scale-tier STANDARD_1" as an argument and this seems to be what causes the problems (did i just make this argument up? - if so it's strange it does not throw an error on validating the job).
So this works:
gcloud ml-engine jobs submit training test_023 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_023 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG
But this fails:
gcloud ml-engine jobs submit training test_024 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_024 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
--scale-tier STANDARD_1 \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG
So i think my issue was that when i tried to start scaling up with a wider model and lots of data i started passing some machine config type stuff via the command line args. I'm not sure if i was doing this correct. It seems like i might be best leaving them in the hptuning_config.yaml file and trying to scale out using a call like this:
gcloud ml-engine jobs submit training test_022 \
--job-dir gs://pmc-ml/clickmodel/vy/output_test_022 \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/trainer/ \
--region us-central1 \
--config /home/andrew_maguire/localDev/codeBase/pmc-analytical-data-mart/clickmodel/hptuning_config.yaml \
-- \
--train-files gs://pmc-ml/clickmodel/vy/rand_data/train_data_20170901_*.csv \
--eval-files gs://pmc-ml/clickmodel/vy/rand_data/dev_data_20170901_*.csv \
--train-steps 100 \
--verbosity DEBUG
where hptuning_config.yaml looks like:
trainingInput:
scaleTier: CUSTOM
masterType: large_model
workerType: standard_gpu
parameterServerType: large_model
workerCount: 10
parameterServerCount: 5
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: accuracy
maxTrials: 5
maxParallelTrials: 2
params:
- parameterName: first-layer-size
type: INTEGER
minValue: 20
maxValue: 500
scaleType: UNIT_LINEAR_SCALE
- parameterName: num-layers
type: INTEGER
minValue: 1
maxValue: 15
scaleType: UNIT_LINEAR_SCALE
- parameterName: scale-factor
type: DOUBLE
minValue: 0.01
maxValue: 1.0
scaleType: UNIT_REVERSE_LOG_SCALE
So i will try now to add back in all my features, and train on 1 day, and then try scale up to more days and training steps etc.
As to the passing "--scale-tier STANDARD_1" i'm not really sure what the root cause is here or if there may be a bug or not. Originally i was thinking rather then worry count the different machine types etc i would just pass "--scale-tier PREMIUM_1" when submitting the job and so (hopefully) not have to worry about the actual machine types etc. So i think there may still be some sort of issue here maybe.