Tensorflow with XLA doesn't fully utilize CPU capacity

Question

I have created a Monte-Carlo simulation model implemented in Tensorflow 2.5. The model mostly consists of vector multiplications inside a tf.while_loop. I am benchmarking the performance on a Linux machine with 8 virtual CPUs. When I run the model in graph mode (without XLA optimization), the model fully utilizes all 8 CPUs (I can see the %CPU to be close to 800% using the top command). However, when I run model after compiling with XLA (by using jit_compile=True inside @tf.function decorator), I can see the %CPU utilization to be close to 250%. Is there a way to force Tensorflow to utilize all available CPU capacity with XLA.

I have experimented with the changing the inter_op_parallelism and intra_op_parallelism settings. While setting both of the threads settings to 1 reduces the CPU utilization from 250% to 100%, increasing them to 8 doesn't increase the utilization beyond 250%.

Any help and suggestions on what might be going on?

score 0 · Answer 1 · answered Jun 21 '22 at 16:42

I had the same question. Using the suggestions found here: https://www.tensorflow.org/xla I modified the JIT compile sequence for my ML model to something like

os.environ['XLA_FLAGS'] = '--xla_dump_to=/tmp/dump'

@tf.function(jit_compile=True)
def foo(data):
    return model(data)

This produces an object (*.o) file in /tmp/dump which I disassembled with objdump -d. Looking at the disassembly, it appears that the compiler has generated straight-line code for the model and computational kernels rather than calling out to libraries that might support parallel execution. I don't see anything that suggests the possibility of parallel execution of this JIT-ted model, although like you I do observe parallel execution when I simply call the model.

However, for me the best performance for this particular model comes from using @tf.function() with jit_compile=False. In this case I observe 'intra_op' parallelism happening - but no 'inter_op' parallelism which is also observed when simply calling the model.

Tensorflow with XLA doesn't fully utilize CPU capacity

1 Answers1