Optimizing Tensorflow for a 32-cores computer

Question

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.

I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)

iga · Accepted Answer · 2018-05-25T18:17:36.130

2

TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:

The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.

You can find useful suggestions here

Use the timeline to see what is executed when

edited May 25 '18 at 18:17

answered May 24 '18 at 18:35

iga

3,571
1
12
22

Thank you, lots of info to digest there. I love that you pointed to the profiler, I still couldn't find it in tensorflow. But could you expand on the linearity of the graph? A typical deep learning model is fundamentally linear, isn't it? I mean, you have a certain number of layers and the output of layer `n` is going to become the input to layer `n+1`. Does that mean that more CPUs generally don't help in such a case? I thought that every CPU would take care of a different batch or something similar... – Gianluca Micchi May 25 '18 at 09:32
I guess my comment painted too bleak of a picture. Most common operations (e.g. matmul, conv, element-wise ops) have good parallelism support. If the dimensions are large enough they will be spread to many cores. If dimensions are not very large or if you have memory limited operations (e.g. transpose), TF can't do much. The classic case for this point is some RNN variant with smallish hidden state size and long sequence to chew through. – iga May 25 '18 at 18:07

Optimizing Tensorflow for a 32-cores computer

1 Answers1