Multiprocessing Robust to Occasional Failures

Question

I have a 100-1000 timeseries paths and a fairly expensive simulation that I'd like to parallelize. However, the library I'm using hangs on rare occasions and I'd like to make it robust to those issues. This is the current setup:

with Pool() as pool:
    res = pool.map_async(simulation_that_occasionally_hangs, (p for p in paths))
    all_costs = res.get()

I know get() has a timeout parameter but if I understand correctly that works on the whole process of the 1000 paths. What I'd like to do is check if any single simulation is taking longer than 5 minutes (a normal path takes 4 seconds) and if so just stop that path and continue to get() the rest.

EDIT:

Testing timeout in pebble

def fibonacci(n):
    if n == 0: return 0
    elif n == 1: return 1
    else: return fibonacci(n - 1) + fibonacci(n - 2)


def main():
    with ProcessPool() as pool:
        future = pool.map(fibonacci, range(40), timeout=10)
        iterator = future.result()

        all = []
        while True:
            try:
                all.append(next(iterator))
            except StopIteration:
                break
            except TimeoutError as e:
                print(f'function took longer than {e.args[1]} seconds')

        print(all)

Errors:

RuntimeError: I/O operations still in flight while destroying Overlapped object, the process may crash
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\anaconda3\lib\multiprocessing\spawn.py", line 99, in spawn_main
    new_handle = reduction.steal_handle(parent_pid, pipe_handle)
  File "C:\anaconda3\lib\multiprocessing\reduction.py", line 87, in steal_handle
    _winapi.DUPLICATE_SAME_ACCESS | _winapi.DUPLICATE_CLOSE_SOURCE)
PermissionError: [WinError 5] Access is denied

Barak Itkin · Answer 1 · 2018-10-12T17:59:18.610

Probably the easiest way is to run each heavy simulation in a separate subprocess, with the parent process watching it. Specifically:

def risky_simulation(path):
    ...

def safe_simulation(path):
    p = multiprocessing.Process(target=risky_simulation, args=(path,))
    p.start()
    p.join(timeout)  # Your timeout here
    p.kill()  # or p.terminate()
    # Here read and return the output of the simulation
    # Can be from a file, or using some communication object
    # between processes, from the `multiprocessing` module

with Pool() as pool:
    res = pool.map_async(safe_simulation, paths)
    all_costs = res.get()

Notes:

If the simulation may hang, you may want to run it in a separate process (i.e. the Process object should not be a thread), as depending on how it's done, it may catch the GIL.
This solution only uses the pool for the immediate sub-processes, but the computations are off-loaded to new processes. We can also make sure the computations share a pool, but that would result in uglier code, so I skipped it.

score 1 · Accepted Answer · answered Oct 12 '18 at 21:40

1

The pebble library has been designed to address these kinds of issues. It handles transparently job timeouts and failures such as C library crashes.

You can check the documentation examples to see how to use it. It has a similar interface as concurrent.futures.

answered Oct 12 '18 at 21:40

noxdafox

14,439
4
33
45

Looks like the answer, but even with their documentation (using their fibonacci example for instance) I'm having trouble integrating their timeout with pulling all the successful runs into something like `n = list(future.result())`. Any suggestions on how to modify the fibonacci example to get a list of results but without the 'future.cancel()'? – rhaskett Oct 12 '18 at 23:46
Use the first example. In there it shows how to pull out all succeeding results while logging the failures. – noxdafox Oct 13 '18 at 09:18
I added code based on the first example above. You may have to do more than `range(40)` if your computer is faster. Two issues. `all` only holds the final value and I'm getting a `RuntimeError: I/O operations still in flight while destroying Overlapped object`. – rhaskett Oct 15 '18 at 16:59
ok fixed it to get all the values in `all` but I still get the runtime error and a PermissionError as well. I think I'll open another question in SE. – rhaskett Oct 15 '18 at 17:07
Without the traceback information there's little which can be done. – noxdafox Oct 15 '18 at 17:11
Updated with trace. – rhaskett Oct 15 '18 at 17:18
pebble version 4.3.9, Python 3.6.6 from conda – rhaskett Oct 15 '18 at 17:28
That looks like some issue between Python multiprocessing and Windows. I assume you are using Windows 10? Maybe open an issue on the project's github. – noxdafox Oct 15 '18 at 18:19

Multiprocessing Robust to Occasional Failures

2 Answers2