2

I'm getting like 99.7% TPU idle time with my training code (https://github.com/ksjae/KoGPT2-train). What are the general methods used to reducing IDLE time? How can I(or any user in general) reduce it to a sane amount?

How can I find the culprit of long idle time?

*data available at gs://kogpt2/model

Most of time is taken by prefetch, but it is very low as seen below. Step time shows 99%+ idle

efe23eds
  • 51
  • 4
  • This is probably due to the TPU having to wait a long time between batches. I would examine your input pipeline for steps that take a long time, and remove them. – Tom C Sep 19 '20 at 23:52
  • @TomC Enqueue takes a long time, how can I improve on that? – efe23eds Sep 21 '20 at 03:22
  • 1
    I looked at your code on github and it seems in train.dataloader, you define input_fn_builder, which is then called in train_tpu.py to train the model. Is this where the enqueue step is happening? You can always a) increase the number of cpus in your VM, b) try out tf.data.experimental.AUTOTUNE for setting num_parallel_reads, etc. You can also try an out-of-memory shuffle to reduce the time taken for shuffling the data. – Tom C Sep 21 '20 at 15:25
  • Out of curiosity, are those statistics you mentioned already considering the [`prefetch`](https://www.tensorflow.org/guide/data_performance#prefetching) operation? If not, does it help in your case? – Willian Fuks Sep 22 '20 at 19:02
  • @WillianFuks Yes, it is already prefetched. It looks like the code was a lot problematic - other implementations were much more efficient(almost no idle time). – efe23eds Sep 23 '20 at 05:40

0 Answers0