Overview
I am working on re-writing and optimizing a set of MATLAB scripts for a computationally intensive scientific computing application into a set of Python scripts with the kernels written in C (run using ctypes). The Python wrapping is necessary for ease of end-user application and configuration. My current hardware is a 12-core Ryzen 7 3700X with 64 GB RAM, but it is also intended to be suitable for running on much larger and lower-clocked clusters.
Input/Output
The section of code this question concerns is highly parallelizable. The input is going to be something like 100-200 sets (serially ordered in working memory) of a few million uniformly organized floats (you can imagine them as 100-200 fairly high-resolution B/W images, all with the same proportions and similar structure). Each such "set" can be processed independently and uninterrupted, for the bulk of the process. There are many computationally (and possibly memory) intensive calculations performed on these - some of it suitable for implementation using BLAS but also more complex upsampling, interpolation, filtering, back-projection and projection, and so forth. the MATLAB implementation I am using as a basis, it is implemented through a Parfor loop calling on a few subroutines and using some MEX functions written in C. The output of each iteration is going to be, again, a few million floats. If I recall correctly (running a realistic trial of the scripts is very messy at the moment - another thing I'm tasked with fixing - so I can't easily check), the computations can be intensive enough that each iteration of the loop can be expected to take a few minutes.
The Conundrum
My most likely course of action will be to turn this entire section of the code into a big "kernel" written in C. I have subfunctions of it written in mixed C/Python already, and those already have way too much Python overhead compared to the time the actual computations need - so I want to replace all of that, and the remainder of all this easily parallelized code, with C. Thus, I have two methods I can use to parallelize the code:
- I have Python create subprocesses, each of which triggers serial C code separately with its section of the data.
- I have Python start a single C process to which I hand all the data, having the C process use OpenMP to create subprocesses to parallelize the code.
I'm familiar with both Python and C multiprocessing, but I have never tried to run multiprocessing C scripts through Python. My question is then, which of these is preferable from a performance standpoint, and are there any aspects I should be considering which I haven't considered here?
Thank you in advance!