Google Cloud ML exited with a non-zero status of 245 when training

Question

I tried to train my model on Google Cloud ML using this sample code:

import keras
from keras import optimizers
from keras import losses
from keras import metrics
from keras.models import Model, Sequential
from keras.layers import Dense, Lambda, RepeatVector, TimeDistributed
import numpy as np

def test():
    model = Sequential()
    model.add(Dense(2, input_shape=(3,)))
    model.add(RepeatVector(3))
    model.add(TimeDistributed(Dense(3)))
    model.compile(loss=losses.MSE,
                  optimizer=optimizers.RMSprop(lr=0.0001),
                  metrics=[metrics.categorical_accuracy],
                  sample_weight_mode='temporal')
    x = np.random.random((1, 3))
    y = np.random.random((1, 3, 3))
    model.train_on_batch(x, y)

if __name__ == '__main__':
    test()

and i got this error:

The replica master 0 exited with a non-zero status of 245. Termination reason: Error.

Detailed error output is big, so i'm pasting it here in pastebin

In console.google.com go to the hamburger menu, choose "ML Engine > Jobs" and click on your job. Scroll to the bottom. How is your RAM usage? Could you have OOMed? — rhaertel80, Apr 27 '17 at 07:52
for this particular job 'There is no data for this chart'. But for my other job, wich is more complex, and have same error, memory usage is 0.0359 — Alex, Apr 27 '17 at 08:19
The log output indicates you are hitting a segmentation fault. With your Cloud ML jobs are you specifying which version of TensorFlow you want to use? — Jeremy Lewi, Apr 27 '17 at 12:51
@JeremyLewi No, i didn't specified version. I just now tried to run job again on test code and it works now. I'll try to test my main project later. — Alex, Apr 27 '17 at 15:06
It may be that your old projects is using an old runtime version by default which has an old version of numpy in which we've occasionally seen these segfaults — Eli Bixby, Apr 27 '17 at 16:29
@EliBixby I did specified runtime version 1.0. And by the way, this error showing up again on the same test code which worked few hours ago — Alex, Apr 27 '17 at 20:03

score 0 · Answer 1 · answered Apr 27 '17 at 14:50

Note this output:

Module raised an exception for failing to call a subprocess Command '['python', '-m', u'trainer.test', '--job-dir', u'gs://my_test_bucket_keras/s_27_100630']' returned non-zero exit status -11.

And I guess the google cloud will run your code with an extra parameter called --job-dir. So perhaps you can try add the following code in your example code?

import ...
import argparse

def test():
model = Sequential()
model.add(Dense(2, input_shape=(3,)))
model.add(RepeatVector(3))
model.add(TimeDistributed(Dense(3)))
model.compile(loss=losses.MSE,
              optimizer=optimizers.RMSprop(lr=0.0001),
              metrics=[metrics.categorical_accuracy],
              sample_weight_mode='temporal')
x = np.random.random((1, 3))
y = np.random.random((1, 3, 3))
model.train_on_batch(x, y)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    # Input Arguments
    parser.add_argument(
      '--job-dir',
      help='GCS location to write checkpoints and export models',
      required=True
    )
    args = parser.parse_args()
    arguments = args.__dict__

    test()
    # test(**arguments) # or if you want to use this job_dir parameter in your code

Not 100% sure this will work but I think you can give it a try. Also I have a post here to do something similar, perhaps you can take a look there as well.

Thanks, actually i followed this tutorial when i started using Google ML, it worked then. But looks like code isn't problem. — Alex, Apr 27 '17 at 15:14

score 0 · Accepted Answer · answered Apr 28 '17 at 15:40

0

Problem is resolved. All I had to do is use tensorflow 1.1.0 instead default 1.0.1

answered Apr 28 '17 at 15:40

Alex

26
5

How did you change the tensorflow version? – Badger Cat May 01 '17 at 18:11
@BadgerCat just add to setup.py install requirement tensorflow==1.1.0 – Alex May 01 '17 at 18:29

Google Cloud ML exited with a non-zero status of 245 when training

2 Answers2

Linked