4

I am creating, multiplying and then summing all elements of two big matrices in numpy. I do this some hundred times with two methods, a loop and with the help of the multiprocessing modul (see the snipet below).

def worker_loop(n):
  for i in n:
    mul = np.sum(np.random.normal(size=[i,i])*np.random.normal(size=[i,i]))

def worker(i):
  mul = np.sum(np.random.normal(size=[i,i])*np.random.normal(size=[i,i]))

n = range(100,300)

pool = ThreadPool(2)
pool.map(worker, n)
pool.close()
pool.join()

worker_loop(n)

Measuring the time tells that the loop is faster than multiprocessing. I have also tried the threading module with no success (then I read that this was a bad idea; read more here)

I started this experimenting with multithreading because I need to convert images, labels, bounding boxes, ... into tfrecords. For that I am studying a file from tensorflow/inception (if you want do dwell build_imagenet_data.py, line 453). I believe that here multithreading works that's why they use it.

Saying this, my question can be put as follows,

  • what am I missing in my code; is it possible to achieve something with small modifications?
  • does the example from inception work because tensorflow is written in c++ and CUDA?
  • when is it advisable to use multiprocessing or multithreading with numpy, tensorflow and the like?
prometeu
  • 679
  • 1
  • 8
  • 23

1 Answers1

4

There is always some overhead (synchronization, data-preparation, data-copies and co.).

But: given a good setup, your matrix-vector and vector-vector operations in numpy are already multithreaded, using BLAS (which is the state of the art standard used everywhere including numpy, matlab and probably tensorflow's cpu-backend; there are different implementations though).

So if BLAS is able to occupy all your cores (easier with big dimensions), you are only seeing the overhead.

And yes, tensorflow in it's core will be implemented by at least one of C/C++/Fortran plus BLAS for it's CPU-backend and some Cuda-libs when targeting GPU. This also means, that the core-algorithms as gradient-calcs and optimization-calcs should never need external parallelization (in 99.9% of all use-cases).

sascha
  • 32,238
  • 6
  • 68
  • 110
  • And to answer the second question, it is not advisable to use multiprocessing or multithreading with numpy or tensorflow unless you are doing some heavy I/O bound work. Tensorflow actually supports this contingency via queue runners. As for the actual calculations, both tensorflow and numpy are already capable of spreading the load to every core you have. – Mad Wombat Sep 06 '17 at 20:49
  • 1
    As far as I know tensorflow uses Eigen, which by default uses it's own (highly optimized) low-level routines. Of course, this does not change the gist of sascha's answer. – dseuss Sep 07 '17 at 12:22
  • @dseuss Interesting. It seems you are right, although there is support for all those BLAS-libs and newer benchmarks seem to show that these are faster (which is expected), despite Eigen's official FAQ (with outdated benchmarks and somewhat limited to single-core). – sascha Sep 07 '17 at 12:29
  • @sascha I don't know how accurate this is but I think that pure-Eigen is really good for small matrix operations as needed for e.g. 3D graphics. For the instances we encounter in tensorflow BLAS is certainly superior, especially on multicore architectures. This was probably a big motivation to enable linking to BLAS from Eigen. – dseuss Sep 07 '17 at 15:39
  • The single thread function uses only one thread; can clearly be seen in the System Monitor, in ubuntu. The multiprocess function uses all my cpu threads but at an average of 30 percent and the time needed to finish is longer at about 6 percent. – prometeu Sep 08 '17 at 16:41