Check TPU workload/utilization

Question

I am training a model, and when I open the TPU in the Google Cloud Platform console, it shows me the CPU utilization (on the TPU, I suppose). It is really, really, low (like 0.07%), so maybe it is the VM CPU? I am wondering whether the training is really proper or if the TPUs are just that strong.

Is there any other way to check the TPU usage? Maybe with a ctpu command?

Yes, the "CPU utilization" tab on the GCP console is in fact a measurement of the CPU usage of the VM attached to the TPU. The work done by that VM is often related to the preparation and moving of memory to and from the TPUs. As Auberon says in his answer, the TPU profiling tools will give you the true picture of how idle the TPUs actually are. — Derek T. Jones, Sep 23 '18 at 19:36
@DerekT.Jones ok I see. that makes much more sense now. Though now I have problems with showing the performance in the tpu profiling tool. See another thread of mine. — craft, Sep 24 '18 at 05:25

score 6 · Accepted Answer · answered Sep 20 '18 at 22:15

I would recommend using the TPU profiling tools that plug into TensorBoard. A good tutorial for install and use of these tools can be found here.

You'll run the profiler while your TPU is training. It will add an extra tab to your TensorBoard with TPU-specific profiling information. Among the most useful:

Average step time
Host idle time (how much time the CPU spends idling)
TPU idle time
Utilization of TPU Matrix units

Based on these metrics, the profiler will suggest ways to start optimizing your model to train well on a TPU. You can also dig into the more sophisticated profiling tools like a trace viewer, or a list of the most expensive graph operations.

For some guidelines on performance tuning (in addition to those ch_mike already linked) you can look at the TPU performance guide.

ok seems like my host idle time is like 98% which is bad, but the tpu idle time is 0%, so thats sounds fishy. what CPU is this referring to actually? the VM one? — craft, Sep 21 '18 at 07:58

score 2 · Answer 2 · answered Sep 20 '18 at 18:54

If you are looking at GCP -> Compute Engine -> TPU, you are looking at the correct spot. If you see the monitoring graphs of your associated Compute Engine instance, you’ll see the CPU graph is different.

Currently, it doesn’t seem to be any other way to look for that information, since none of these options provide it:

gcloud compute tpus describe <tpu-name> --zone=<zone>

ctpu status --details

Nor does the TPU API

As whether your training is proper or not, it would be hard to say, you can refer to Using TPU and make sure you are following the guidelines there. Another useful resource would be Improving training speed.

score 2 · Answer 3 · answered Aug 12 '19 at 03:13

(vm)$ capture_tpu_profile --tpu=$TPU_NAME  --monitoring_level=2

Setting monitoring_level=2 displays more detailed information:

TPU type: TPU v2
Number of TPU Cores: 8
TPU idle time (lower is better): 0.091%
Utilization of TPU Matrix Units is (higher is better): 10.7%
Step time: 1.95 kms (avg), 1.90kms (minute), 2.00 kms (max)
Infeed percentage: 87.5% (avg). 87.2% (min), 87.8 (max)

Ref: https://cloud.google.com/tpu/docs/cloud-tpu-tools#monitor_job

Check TPU workload/utilization

3 Answers3