I have a computationally heavy project that appears to be a good candidate for parallelization. The code uses a large number of GIS contours and takes up roughly 1.5 GB of memory as a single process. There are two levels of logic that can be parallelized: an outer level that splits the project area into smaller (but still rather larger) areas, and an inner loop that does a lot of math with pretty short segements.
Attempts to use concurrent.futures on the outer loop failed due to pickle. Pathos ran and created multiple process but used a ton of memory and was actually slower. I'm assuming the slow down was due to dill serializing and recreating very large objects.
I haven't attempt to parrallelize the inner loop yet but should be able to break the code and contours into relatively small objects (~10 KB), not counting external modules (shapely). While there are a plethora of parallel processing options for python I haven't found good discussion on the best way handle objects and managing memory. What is a good package/method to efficiently split off a small object into a new process from a much larger process?
I think it would be preferable to use both levels of parrallelization, though I'm not sure and need to do some profiling. The outer loop can be made slightly more memory efficient but it may not be realistic to split it into multiple processes. The inner loops should be easy to break into small, memory efficient pieces, I'm just not sure of the best package to use.
Edit: The vast majority of the processing is within the shapley package which is a python front end for GEOS which is in C (I think). I'm currently using python 2.7 but am willing to switch to 3 if it is beneficial.