Should I run a single parallelized C script from Python or a parallel set of serial C scripts?

Question

Overview

I am working on re-writing and optimizing a set of MATLAB scripts for a computationally intensive scientific computing application into a set of Python scripts with the kernels written in C (run using ctypes). The Python wrapping is necessary for ease of end-user application and configuration. My current hardware is a 12-core Ryzen 7 3700X with 64 GB RAM, but it is also intended to be suitable for running on much larger and lower-clocked clusters.

Input/Output

The section of code this question concerns is highly parallelizable. The input is going to be something like 100-200 sets (serially ordered in working memory) of a few million uniformly organized floats (you can imagine them as 100-200 fairly high-resolution B/W images, all with the same proportions and similar structure). Each such "set" can be processed independently and uninterrupted, for the bulk of the process. There are many computationally (and possibly memory) intensive calculations performed on these - some of it suitable for implementation using BLAS but also more complex upsampling, interpolation, filtering, back-projection and projection, and so forth. the MATLAB implementation I am using as a basis, it is implemented through a Parfor loop calling on a few subroutines and using some MEX functions written in C. The output of each iteration is going to be, again, a few million floats. If I recall correctly (running a realistic trial of the scripts is very messy at the moment - another thing I'm tasked with fixing - so I can't easily check), the computations can be intensive enough that each iteration of the loop can be expected to take a few minutes.

The Conundrum

My most likely course of action will be to turn this entire section of the code into a big "kernel" written in C. I have subfunctions of it written in mixed C/Python already, and those already have way too much Python overhead compared to the time the actual computations need - so I want to replace all of that, and the remainder of all this easily parallelized code, with C. Thus, I have two methods I can use to parallelize the code:

I have Python create subprocesses, each of which triggers serial C code separately with its section of the data.
I have Python start a single C process to which I hand all the data, having the C process use OpenMP to create subprocesses to parallelize the code.

I'm familiar with both Python and C multiprocessing, but I have never tried to run multiprocessing C scripts through Python. My question is then, which of these is preferable from a performance standpoint, and are there any aspects I should be considering which I haven't considered here?

Thank you in advance!

I don't know much about Python, but in pure C you could produce a library or DLL in C and then link that one to the main executable. Should be somewhat faster than launching a process in run-time. At any rate, the launch overhead is only executed once so maybe not a big deal? — Lundin, Jun 16 '20 at 10:22
That is how it is implemented at present - I use the CDLL function of Python Ctypes to load a .so file as a library and then call functions from it in the Python code. Much of the overhead is actually from conversions to and from C data formats in Python, which I suspect I could handle in a more efficient manner - but it's just easier to handle it with more C. So the way I imagine it is that I would have a C master function called, say, "ForkingProcess" which forks off all the processes and has them call a function that calls all the necessary kernel functions — LC Nielsen, Jun 16 '20 at 10:32
I don't use Matlab and am not too well qualified to offer advice, but would like to make the observation that, if you are considering moving between a single, powerful machine and a cluster, Redis can function extremely effectively as a lightning-fast, in-memory *"data structure"* server, sharing Numpy arrays, images, lists, sets, queues between multiple machines across network and being accessible/interoperable from C/C++, Python, and bash commandline. Just a thought, may not be advisable/relevant to you. — Mark Setchell, Jun 16 '20 at 10:33
So maybe do this the other way around: make the "core engine" in C and let C call upon Python when you need to interface with whatever? MATLAB should be able to generate C just fine, yeah? — Lundin, Jun 16 '20 at 10:35
@MarkSetchel Thanks, that could definitely have applications for my project even if not directly related to this problem! — LC Nielsen, Jun 16 '20 at 10:36
@Lundin An interesting thought, but I don't think it will work for this application. The top levels of the application will have to be editable and adjustable by end users with no familiarity with C, and I have to make the structure of it reasonably easy to follow for people who are perhaps only familiar with (and not necessarily very proficient in) Python and MATLAB. If it had only been for my own use, I would probably have done what you suggest and only used Python for things like making plots. — LC Nielsen, Jun 16 '20 at 10:42
Sounds like pretty classic front end vs back end to me... what's the actual bottleneck though? Is it truly the parameter passing/calling convention between languages, or is it GUI/MATLAB fluff? — Lundin, Jun 16 '20 at 10:57
@Lundin There's no GUI. It's hard to say because I can't easily run a true test at this point. But using trial data my profilers suggest it's various fluff and parameter handling (esp parameter typing), yes. In the old MATLAB scripts it seems like MATLAB's Mex C files do not optimize very well. I might be able to just use the Ctypes types from the beginning... but in any event, I'm a scientist, not a software developer, so my priorities are probably a bit unusual from a software development perspective. — LC Nielsen, Jun 16 '20 at 13:08

Should I run a single parallelized C script from Python or a parallel set of serial C scripts?

Overview

Input/Output

The Conundrum

0 Answers0