1

I have a task where I need to run the same function on many different pandas dataframes. I load all the dataframes into a list then pass it to Pool.map using the multiprocessing module. The function code itself has been vectorized as much as possible, contains a few if/else clauses and no matrix operations.

I'm currently using a 10-core xeon and would like to speed things up, ideally passing from Pool(10) to Pool(xxx). I see two possibilities:

  • GPU processing. From what I have read though I'm not sure if I can achieve what I want and would in any case need lots of code modification.

  • Xeon-Phi. I know it's being discontinued, but supposedly code adaptation is easier and if thats really the case I'd happily get one.

Which path should I concentrate on? Any other alternatives?

Software: Ubuntu 18.04, Python 3.7. Hardware: X99 chipset, 10-core xeon (no HT)

  • It really depends on the code. BLAS calls are usually CPU bound and there is often no alternative to faster hardware (expect changing the BLAS backend). Numpy- code which doesn't depend to much on BLAS calls can often made faster using Numba/Fortran/C, but as said the effort and expected speedup really depends on your problem/code.... – max9111 Apr 09 '19 at 08:27

2 Answers2

1

Took a while, but after changing it all to numpy and achieving a little more vectorization I managed to get a speed increase of over 20x - so thanks Paul. max9111 thanks too, I'll have a look into numba.

0

You can rely on new Intel 2066 platform or Xeon. With newest AVX512 they accelerated numpy processing a lot (numpy is the base of pandas). Check: https://software.intel.com/en-us/articles/the-inside-scoop-on-how-we-accelerated-numpy-umath-functions

First of all, try to switch to numpy-based calculations (even with simple .values over the series), it can improve the processing speed up to 10x

You can also try to get 2 CPU motherboard and get more parallelization for calculation.

In the most situations, the bottleneck is not the processing of the data, but IO operations - reading from drive to memory. This will be the problem using GPU too.

Pavel Kovtun
  • 367
  • 2
  • 8
  • Thanks for the quick answer Paul. I really want to avoid a full system upgrade so am stuck with 1x 2011-3 xeon. I see however that the xeon-phis support AVX-512, so that looks more interesting as a hardware addition. Regarding IO: I load/write all the data from the drive before and after the processing (its speedy enough) so this shouldn't be the limiting factor. Could be mistaken though. – alittlebluebug Apr 08 '19 at 15:49
  • With additional "external" hardware, I think you will struck with IO bottleneck, where the system will forced to bring data from memory to internal memory of the calculation unit (this is the problem with GPU). I don't know the characteristic of your data, tho. Another option - try to go to AWS instances with huge CPU power (instance per table), but which would be engaged only for a little amount of time (for less than 1 minute) – Pavel Kovtun Apr 08 '19 at 15:54
  • 1
    Ah, I see. I assume ram<>cpu communication is much faster than say ram<>gpu (if I am picturing things properly). Its timeseries data with a timestamp index which I use heavily, so converting to numpy is abit tricky but I'll give it a go and see how far I get. I'll check out the intel python dist. as well. Thanks, useful stuff. – alittlebluebug Apr 09 '19 at 15:48