Tensorboard profiling a predict call using Cloud TPU Node

Question

I've been trying to profile a predict call of a custom NN model using a Cloud TPU v2-8 Node.

It is important to say that my prediction call takes about 2 minutes to finish and I do it using data divided in TFRecord batches.

I followed the official documentation "Profile your model with Cloud TPU Tools" and I tryied to capture a profile:

Using Tensorboard UI and
The "programatic way" with a tf.profiler.experimental.start() and tf.profilier.experimental.stop() wrapping the predict call, but I had no success in both cases.

# TPU Node connection is done before...

# TPU at this point is already running
logdir_path = "logs/predict"
tf.profiler.experimental.start(logdir_path)
# Tensorflow predict call here
tf.profiler.experimental.stop()

I could generate some data in both cases (Tensorboard UI and profiler call), but when I try to open it in Tensorboard pointing the logdir path, I received a "No dashboard are active for the current data set" message.

Is there any way to profile a Tensorflow/Keras prediction call with a model running in a Cloud TPU Node?

Curious fact - There seems to be an inconsistency in the Tensorflow docs and Cloud TPU docs: in Tensorflow Optimization Docs we can see that tf.profiler.experimental.start/stop calls are not supported by TPU hardware, but in Google Cloud docs this is the recommended method to capture a profile in TPU.

Config:

Tensorflow 2.6.1
Tensorboard 2.9.1
Python 3.8
Cloud TPU Node v2-8

ILS · Answer 1 · 2022-09-13T13:44:32.653

Please check the trace files in your logdir. If they are too small, it's likely that you got some issues during tracing.
Just be sure that you typed the right command. $ tensorboard --logdir logs/predict
Try another profiling method by using tf.profiler.experimental.client.start(...), as indicated by TF profiler Docs. Below is the code snippet.

import tensorflow as tf
from threading import Thread

def call_trace(tpu_resolver):  # This should be called asynchronously
  # a profiler service has been started in the TPU worker at port 8466
  service_addr = ":".join(tpu_resolver.get_master().split(":")[:-1] +
                          ["8466"])  # need to change for TPU pod
  tf.profiler.experimental.client.trace(service_addr=service_addr,
                                        logdir="gs://your_logdir",
                                        duration_ms=5000)

tpu_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(...)
# Other initialization codes

thr = Thread(target=call_trace, args=(tpu_resolver,))
thr.start()
# Codes you want to execute on the cloud TPU node
thr.join()

Then open tensorboard for visualization.

$ tensorboard --logdir gs://your_logdir

score 0 · Answer 2 · answered Dec 13 '22 at 04:37

For TPU Node architecture you can also try using cloud-tpu-profiler:

pip3 install --upgrade "cloud-tpu-profiler>=2.3.0"

Then capture the profile using

capture_tpu_profile --tpu=$TPU_NAME --logdir=${MODEL_DIR} --duration_ms=2000 --num_tracing_attempts=10

For details you can refer here.

TPU VM is recommended TPU architecture and you can follow Profile TPU VM guide when using TPU VMs.

Tensorboard profiling a predict call using Cloud TPU Node

2 Answers2