Can tfdbg be run with Cloud ML Engine training job?

Question

I cannot find any examples of the tfdbg tool being run with a cloud ML engine. This post shows how to wrap a tensorflow session with the debugger, but I have not encountered any way to run an ML engine package in debug mode. Has anybody found a way to do this?

rhaertel80 · Answer 1 · 2017-11-07T16:14:17.320

2

CloudML Engine does not support the interactive CLI debugger.

However, you should be able to use the offline debugger. How you get it to work in your case will depend on how your code is structured.

Suppose your code is written to accept a --job-dir command-line argument. When you submit your job you will have something like this:

export JOB_NAME=my_job
export JOB_DIR=gs://my_bucket/$JOB_NAME
gcloud ml-engine jobs training submit ... --job-dir=$JOB_DIR ...

# Start with this code.
from tensorflow.python import debug as tfdbg

# job_dir is on GCS and is passed on the command-line if you specify
# it when submitting your training job.
dump_dir = os.path.join(job_dir, 'tfdbg_dumps')

For more info on watch_fn, see docs.

Core TensorFlow (User-created Session)

If you're using "core" TensorFlow, i.e., creating your own session, then will replace the construction of any tf.Session objects like so:

sess = tfdbg.DumpingDebugWrapperSession(sess, dump_dir)
sess.run(fetches=my_fetches, feed_dict=my_feed_dict)

See DumpingDebugWrapperSession docs for more info.

Estimator API

If you are using learn_runner or Experiment, you can use DumpingDebugHook:

experiment = Experiment(
  ...,
  train_monitors =[tfdbg.DumpingDebugHook(dump_dir)],
  ...
)
learn_runner.run(experiment)

Unfortunately, I cannot see a way to use filters such as tfdbg.has_inf_or_nan except with LocalCLIDebugHook, so you'll just have to analyze the tensors offline.

Offline analysis

Once the data is available in GCS, you can examine the dumps using the provided offline_analyzer executable module. You'll have to choose one of the run sub directories:

python -m tensorflow.python.debug.cli.offline_analyzer \
    --dump_dir=$JOB_DIR/tfdbg_dumps/run_XXXXXXX

edited Nov 07 '17 at 16:14

answered Aug 12 '17 at 03:16

rhaertel80

8,254
1
31
47

How would this be modified if I was running an experiment with learn_runner instead of sess.run directly? Is this worth a separate question or would it be best to update the answer here to cover both cases? – reese0106 Nov 03 '17 at 02:28
Also, would an example of the watch_fn be the tfdbg.has_inf_or_nan if we are looking to debug why the loss is diverging to NaN? – reese0106 Nov 03 '17 at 02:32
Updated the answer. – rhaertel80 Nov 03 '17 at 04:44
I think this is working, but want to note some peculiar behavior I am seeing and understand if it is intended. When I go to the job logs it shows the loss at step 1 but then it stops (although the job still says running) and nothing else is being output into the job logs. When I look in GCS, there does appear to be some stuff in tfdbg_dumps folder, but when I run --dump_dir=$JOB_DIR I got the error "Dump file path does not conform to the naming pattern: %s" % base) ValueError: Dump file path does not conform to the naming pattern: checkpoint" – reese0106 Nov 03 '17 at 17:19
Separately I ran with --dump_dir=$JOB_DIR/tfdbg_dumps and I got a separate error "tensorflow.python.framework.errors_impl.NotFoundError: The specified path gs://..jobdirhere../tfdbg_dumps/gs://..jobdirhere../tfdbg_dumps/run_1509727304147533_0/_tfdbg_core_metadata_sessionrun00000000000025_1509727305971236 was not found." which I am also not really sure how to interpret. Any idea what is going wrong? Would the job need to complete before the tfdbg_dumps are available to run the offline_analyzer? – reese0106 Nov 03 '17 at 17:21
Sorry for the delay. I found a bug in offline_analyzer and have submitted a patch. I'll have to see if TensorFlow team will be willing to backport the fix. In the meantime, perhaps I can send you the patch directly. What version of TF are you using? – rhaertel80 Nov 07 '17 at 15:34
It appears that offline_analyzer will work with absolute paths. So until the tool is patched, copy one or more of the run* directories to your local machine /VM and provide the absolute path. – rhaertel80 Nov 07 '17 at 18:22
I am trying to run through this again and never found a way to run tfdbg on the model. How can I run in the context of tf.estimator.train_and_evaluate()? would I need to pass a train_monitor to my tf.estimator.RunConfig? – reese0106 Dec 24 '18 at 16:25
You can add hooks to the TrainSpec passed to Estimator.train_and_evaluate() – rhaertel80 Jan 04 '19 at 00:35

Can tfdbg be run with Cloud ML Engine training job?

1 Answers1