Hardware for python multiprocessing

Question

I have a task where I need to run the same function on many different pandas dataframes. I load all the dataframes into a list then pass it to Pool.map using the multiprocessing module. The function code itself has been vectorized as much as possible, contains a few if/else clauses and no matrix operations.

I'm currently using a 10-core xeon and would like to speed things up, ideally passing from Pool(10) to Pool(xxx). I see two possibilities:

GPU processing. From what I have read though I'm not sure if I can achieve what I want and would in any case need lots of code modification.
Xeon-Phi. I know it's being discontinued, but supposedly code adaptation is easier and if thats really the case I'd happily get one.

Which path should I concentrate on? Any other alternatives?

Software: Ubuntu 18.04, Python 3.7. Hardware: X99 chipset, 10-core xeon (no HT)

It really depends on the code. BLAS calls are usually CPU bound and there is often no alternative to faster hardware (expect changing the BLAS backend). Numpy- code which doesn't depend to much on BLAS calls can often made faster using Numba/Fortran/C, but as said the effort and expected speedup really depends on your problem/code.... — max9111, Apr 09 '19 at 08:27

score 1 · Answer 1 · answered Apr 12 '19 at 15:29

1

Took a while, but after changing it all to numpy and achieving a little more vectorization I managed to get a speed increase of over 20x - so thanks Paul. max9111 thanks too, I'll have a look into numba.

answered Apr 12 '19 at 15:29

alittlebluebug

38
4

score 0 · Accepted Answer · answered Apr 08 '19 at 15:14

0

You can rely on new Intel 2066 platform or Xeon. With newest AVX512 they accelerated numpy processing a lot (numpy is the base of pandas). Check: https://software.intel.com/en-us/articles/the-inside-scoop-on-how-we-accelerated-numpy-umath-functions

First of all, try to switch to numpy-based calculations (even with simple .values over the series), it can improve the processing speed up to 10x

You can also try to get 2 CPU motherboard and get more parallelization for calculation.

In the most situations, the bottleneck is not the processing of the data, but IO operations - reading from drive to memory. This will be the problem using GPU too.

answered Apr 08 '19 at 15:14

Pavel Kovtun

367
2
8

Thanks for the quick answer Paul. I really want to avoid a full system upgrade so am stuck with 1x 2011-3 xeon. I see however that the xeon-phis support AVX-512, so that looks more interesting as a hardware addition. Regarding IO: I load/write all the data from the drive before and after the processing (its speedy enough) so this shouldn't be the limiting factor. Could be mistaken though. – alittlebluebug Apr 08 '19 at 15:49
With additional "external" hardware, I think you will struck with IO bottleneck, where the system will forced to bring data from memory to internal memory of the calculation unit (this is the problem with GPU). I don't know the characteristic of your data, tho. Another option - try to go to AWS instances with huge CPU power (instance per table), but which would be engaged only for a little amount of time (for less than 1 minute) – Pavel Kovtun Apr 08 '19 at 15:54
1

Ah, I see. I assume ram<>cpu communication is much faster than say ram<>gpu (if I am picturing things properly). Its timeseries data with a timestamp index which I use heavily, so converting to numpy is abit tricky but I'll give it a go and see how far I get. I'll check out the intel python dist. as well. Thanks, useful stuff. – alittlebluebug Apr 09 '19 at 15:48

Hardware for python multiprocessing

2 Answers2