0

I'm using the xavier conv2d method for initializing my variables like this:

initializer = tf.contrib.layers.xavier_initializer_conv2d()
variable = tf.get_variable(name=name,shape=shape,initializer=initializer)

the training runs locally using gcloud ml-engine local train, however it crashes when sending it as a job to the cloud.

The crash log: "Module raised an exception <type 'exceptions.SystemExit'>:-15."

If I replace the xavier initializer by a random uniform initializer, the training works both on my local machine and on the cloud:

 initializer = tf.random_uniform_initializer(-0.25,0.25)

I'm running gpu_enabled tensorflow version 1.01 on my local machine using python 2.7.13

joaeba
  • 73
  • 1
  • 4
  • Are you willing / able to share more of the log preceding the exit? There should be more details about the exception, although you may have to scroll up a bit and expand the entries. – rhaertel80 Mar 27 '17 at 23:52
  • @rhaertel80 here it is: `The replica master 0 exited with a non-zero status of 245. Program exit details: exit_code: 245 reason: "Error" message: "{\"exit_code\": -11}\n" started_at { seconds: 1490825252 } finished_at { seconds: 1490825349 } container_id: ` after that I get the signal 15 message. Does this help? – joaeba Mar 30 '17 at 13:47
  • That's a SEGFAULT. Let me check if this is a known problem. – rhaertel80 Apr 03 '17 at 16:39

0 Answers0