How to read logs before Deadline Exceeded on Init TPU system

Question

I'm trying to run a model with Python 2.7 on a TPU with my own .tfrecord data file and all my code compiles, but the moment the TPU start doing its magic I don't have a clue what is going behind the scenes.

Is there a way to track what is going on behind the scenes with a tf.debugger or something similar?

This is the only error message I get:

tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded on Init TPU system

Thank you!

score 0 · Answer 1 · answered Aug 31 '18 at 21:36

General Debugging

There are a few ways you can get more information on what the TPU is doing.

The most straightforward is adding tf.logging statements. If you're using TPUEstimator you'll likely want to have this logging inside your model_fn as this is usually where the core TPU-executed logic is. Make sure that you have your verbosity set at the right level to capture anything you're logging. Note however that logging may impact the performance of your TPU more significantly than it would when running on other devices.

You can also get detailed information on what ops are running and taking up resources on the TPU using the Cloud TPU tools. These tools will add extra tabs to your TensorBoard.

These tools are more meant for performance tuning than for debugging, but they still may be of some use to you in seeing what ops are being run before a crash occurs.

Troubleshooting DeadlineExceededError

The specific issue you're running into may not be helped by more logging or profiling. The deadline exceeded error can be caused by an issue with the host connecting to the TPU. Normally when there's an error on the TPU, two stack traces will be returned, one from the host and one from the TPU. If you're not getting any trace from the TPU side, the host may have never been able to connect.

As a quick troubleshooting step you can try is stopping and restarting your TPU server:

gcloud compute tpus stop $TPU_SERVER_NAME && gcloud compute tpus start $TPU_SERVER_NAME

This usually resolves any issues that the host has communicating with the TPU. The command is copied from the very helpful TPU troubleshooting page.

The page also gives the most common reason that the connection between host and TPU is unable to be established in the first place:

If TensorFlow encounters an error during TPU execution, the script sometimes seems to hang rather than exit to the shell. If this happens, hit CTRL+\ on the keyboard to trigger a SIGQUIT, which causes Python to exit immediately.

Similarly, hitting CTRL+C during TPU execution does not shut down TensorFlow immediately, but instead waits until the end of the current iteration loop to exit cleanly. Hitting CTRL+\ causes Python to exit immediately.

If the TPU is still trying to finish the iteration loop from the last run, the host will be unable to connect. Using the suggested CTRL+\ can prevent this in the future.

How to read logs before Deadline Exceeded on Init TPU system

1 Answers1