1

I want to profile Tensorflow model on CloudML. When I use tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE), my process dies with non zero exit code without details of what happened.

I tried adding and removing the code which turns on this option, and there's 100% correlation between this option and the death of the process.

The error message is 'The replica master 0 exited with a non-zero status of 250. Termination reason: Error. To find out more about why your job exited please check the logs'

How can I diagnose and fix this problem?

Konstantin Solomatov
  • 10,252
  • 8
  • 58
  • 88
  • I'm an engineer on Cloud ML Engine. Sorry for the trouble would you mind sharing a job id where this happens? If you don't want to post it publicly you can email it to us at cloudml-feedback@google.com. – Jeremy Lewi May 01 '17 at 12:16
  • @JeremyLewi Thanks for the quick reply. It's a toy example, I am just learning tensorflow and experimenting with cifar dataset on gpu. The job id is cifar_20170430_215857 If you need other information, let me know. – Konstantin Solomatov May 01 '17 at 12:45
  • @JeremyLewi Are there any updates on it? Did the job_id help you to reproduce the problem? If needed, I can send you the whole code if needed to reproduce it. – Konstantin Solomatov May 02 '17 at 12:44
  • 1
    We are investigating. We think it might be the same segfault as in this [question](http://stackoverflow.com/questions/43651296/google-cloud-ml-exited-with-a-non-zero-status-of-245-when-training) so you might want to try the workaround in that question (i.e. using TF 1.1.0). – Jeremy Lewi May 02 '17 at 23:04
  • @JeremyLewi After upgrading to TF 1.1.0 exception disappeared, but I don't see any profiling information from gpu in tensorboard. It all grayed out. – Konstantin Solomatov May 03 '17 at 02:02
  • @JeremyLewi The job id is cifar_20170502_215031 – Konstantin Solomatov May 03 '17 at 02:03
  • I reproduced the same problem on local machine. It was fixed by adding /usr/local/cuda-8.0/extras/CUPTI/lib64 to LD_PATH – Konstantin Solomatov May 17 '17 at 01:08

2 Answers2

1

It was fixed by using tensorflow 1.1.0 instead of 1.0.0. Though, profiling information wasn't shown.

Konstantin Solomatov
  • 10,252
  • 8
  • 58
  • 88
0

For your question, basically the exit status means your code got a SIGABRT during run.

Update: There is a issue of loading libcupti. Cloud ML Engine has found a bug related to it. Fix is in progress. The problem will be resolved in future release.

lwz1992
  • 304
  • 1
  • 4