I have created a Monte-Carlo simulation model implemented in Tensorflow 2.5. The model mostly consists of vector multiplications inside a tf.while_loop
. I am benchmarking the performance on a Linux machine with 8 virtual CPUs. When I run the model in graph mode (without XLA optimization), the model fully utilizes all 8 CPUs (I can see the %CPU to be close to 800% using the top
command). However, when I run model after compiling with XLA (by using jit_compile=True
inside @tf.function
decorator), I can see the %CPU utilization to be close to 250%. Is there a way to force Tensorflow to utilize all available CPU capacity with XLA.
I have experimented with the changing the inter_op_parallelism
and intra_op_parallelism
settings. While setting both of the threads settings to 1 reduces the CPU utilization from 250% to 100%, increasing them to 8 doesn't increase the utilization beyond 250%.
Any help and suggestions on what might be going on?