1

Linux offers great parallelization-tools like pthreads and openmp. They make it relatively easy to parallellize calculations. Even though I am no expert in optimization at assembly level, it seems fairly trivial to implement element-wise operations on large arrays in parallel. Idem for matrix multiplication, solving linear systems, etc.

However, if I install numpy from apt-get (sudo apt-get install python3 python3-dev python3-numpy ipython3), the elementwise operations are performed on just one core. This is unfortunate, since python does not offer parallelization due to the GIL. Having parallelization at the lower level of BLAS would allow me to utilize all cores and get a significant speedup (6-8 for my i7 with 8 threads), ergo less waiting and more experimenting while still working in a comfortable language (python).

This

y = (lambda x: x ** x) ( numpy.arange(1e8) )

simple element-wise x ** x could be run in parallel, yet it runs on just one core for my default ubuntu apt-get numpy installation.

The same goes for theano. I think theano also uses BLAS under the hood, since one of it's dependencies is numpy (which in turn uses BLAS). Whenever I train a keras or theano neural network, just one core is being used.

To the best of my understanding BLAS is just an interface description, not the actual implementation. Nevertheless, in ubuntu there is a libblas3 and libatlas3-base package, which seem to be different.

Furthermore: compiling atlas from source was not a big success: the configure script kept complaining about throttling which seems to be inherent to my CPU which always seems to run 100MHz below the max frequency, and which also has a 2.4GHz max and a possibility to boost to 3.4GHz for half a minute or so - I think this is called thermal throttling.

What are my options regarding multi-core usage of numpy operations for matrices and such? Do they come out of the box? If not, why not? Why is it so hard to provide a linear algebra engine for numpy which is parallel? With 4 or 8 cpu threads, I would expect that assembly-level single-thread optimizations can't beat that.

EDIT

I also tried the following commands:

OMP_NUM_THREADS=8 LD_PRELOAD=/usr/lib/atlas-base/libatlas.so python3 -c 'import numpy; x=numpy.arange(1e8); y=x**x'
OMP_NUM_THREADS=8 LD_PRELOAD=/usr/lib/libopenblas.so python3 -c 'import numpy; x=numpy.arange(1e8); y=x**x'

Which also just use 1 core.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Herbert
  • 5,279
  • 5
  • 44
  • 69
  • Elementwise operations (["ufuncs"](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs)) are implemented by Numpy C code and are NOT parallelized. Only some linear algebra operations are delegated to a BLAS library and may run in parallel, depending on the library used. Check again with `numpy.dot` and large 2D arrays and see if threading becomes enabled. –  Apr 15 '16 at 14:31
  • why would elementwise operations not be parallelized? This seems like an easy-to-implement and a nice-to-have, especially since parallelization is python is such a pain. – Herbert Apr 15 '16 at 14:41
  • I think the most important reason is that Numpy has to loop over the whole array(s) every time, for every single operation. This makes Numpy code very quickly memory bandwidth limited, and not so much CPU power limited. So the potential gains with multi-threading are limited. But maybe you also underestimate how difficult it would be to implement. Sure for 1D arrays it looks certainly doable. But the problem is that even 1D operations pass through the general ufunc machinery, which is designed to handle a variable number of multi-dim arguments and also takes care of buffering and broadcasting. –  Apr 15 '16 at 14:57
  • Note that in some cases (e.g. with heavy use of trigonometric functions) it can pay off to do manual threading with Python, because Numpy ufuncs generally release the GIL just before the internal array iterator is started. For other cases there is the `numexpr` module. –  Apr 15 '16 at 15:02
  • @morningsun I get it, and also a lot of operations occur on a subset of the array, hence one also needs to take care of strides, offsets, etc. Nevertheless, low-level parallelization would be a godsend, as it alleviates the need for high-level multiprocessing parallelization which in general is hard to make memory-efficient. The ugly truth is however, that after many numpy experiments, I have never seen more than one of my eight threads (4 cores) being used. Before I moved to numpy, I ran many experiments in Matlab which I could easily parallelize (I admit, at a higher level) using `parfor`. – Herbert Apr 15 '16 at 15:21
  • Yeah I have to agree with you on that one. Matlab does have its strong points. –  Apr 15 '16 at 15:59
  • @Herbert whatever works for you with `parfor`, should be similarly doable with `multiprocessing.Pool.map()`. I seem to think that MATLAB is just as single-core as python unless you explicitly do something about it (such as `parfor`). – Andras Deak -- Слава Україні Apr 15 '16 at 23:43
  • @AndrasDeak I know about python's `multiprocessing`, and I also noticed that it is supposed to do copy-on-write, nevertheless it uses approximately n_jobs+1 times as much memory as I would expect it to use, magically to be solved when I introduce boilerplate code to wrap a numpy array around a shared `Array` buffer. Moreover, I've noticed `Pool` to fail to clean up workers very often. My point: `numpy` would be extremely powerful if it could use all my cores/threads as it would introduce scale-able code with at no coding expense. – Herbert Apr 16 '16 at 07:35

0 Answers0