Process dies if profiling is turned on

Question

I want to profile Tensorflow model on CloudML. When I use tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE), my process dies with non zero exit code without details of what happened.

I tried adding and removing the code which turns on this option, and there's 100% correlation between this option and the death of the process.

The error message is 'The replica master 0 exited with a non-zero status of 250. Termination reason: Error. To find out more about why your job exited please check the logs'

How can I diagnose and fix this problem?

I'm an engineer on Cloud ML Engine. Sorry for the trouble would you mind sharing a job id where this happens? If you don't want to post it publicly you can email it to us at cloudml-feedback@google.com. — Jeremy Lewi, May 01 '17 at 12:16
@JeremyLewi Thanks for the quick reply. It's a toy example, I am just learning tensorflow and experimenting with cifar dataset on gpu. The job id is cifar_20170430_215857 If you need other information, let me know. — Konstantin Solomatov, May 01 '17 at 12:45
@JeremyLewi Are there any updates on it? Did the job_id help you to reproduce the problem? If needed, I can send you the whole code if needed to reproduce it. — Konstantin Solomatov, May 02 '17 at 12:44
We are investigating. We think it might be the same segfault as in this [question](http://stackoverflow.com/questions/43651296/google-cloud-ml-exited-with-a-non-zero-status-of-245-when-training) so you might want to try the workaround in that question (i.e. using TF 1.1.0). — Jeremy Lewi, May 02 '17 at 23:04
@JeremyLewi After upgrading to TF 1.1.0 exception disappeared, but I don't see any profiling information from gpu in tensorboard. It all grayed out. — Konstantin Solomatov, May 03 '17 at 02:02
I reproduced the same problem on local machine. It was fixed by adding /usr/local/cuda-8.0/extras/CUPTI/lib64 to LD_PATH — Konstantin Solomatov, May 17 '17 at 01:08

score 1 · Accepted Answer · answered May 03 '17 at 20:18

1

It was fixed by using tensorflow 1.1.0 instead of 1.0.0. Though, profiling information wasn't shown.

answered May 03 '17 at 20:18

Konstantin Solomatov

10,252
8
58
88

lwz1992 · Answer 2 · 2017-05-06T00:18:33.273

0

For your question, basically the exit status means your code got a SIGABRT during run.

Update: There is a issue of loading libcupti. Cloud ML Engine has found a bug related to it. Fix is in progress. The problem will be resolved in future release.

edited May 06 '17 at 00:18

answered May 03 '17 at 00:48

lwz1992

304
1
4

Process dies if profiling is turned on

2 Answers2