Parallel python worker graceful failure

Question

While using pp to parallelize a significantly complex machine learning problem I'm finding myself having to rely fairly extensively on third party libraries which are of varying quality. One in particular has a decent amount of edge case crashes when used intensively on varying datasets. I will eventually have to solve these, but in the short term it is too much to try to fix both my bugs and theirs - and this library is really the best one.

My question is: Is there an established pattern to be used to allow for graceful failure of local worker processes in pp?

The options as I see them are:

Don't use ANY local worker processes, use only REMOTE workers - and then rely on the socket timeout.
Shell all work out to a secondary python script which I wrap and execute as a separate process, then just use the exit code to check for crashes. This would probably have to be combined with a timeout as well to guard for non-segfault failure cases.

Am I missing something here? I've been looking at pp.py and as far as I can tell there is no exit detection on the worker processes.

Alternatively, suggest a better way to tackle this sort of parallelization than using PP. — amirpc, Jul 05 '12 at 22:25

Parallel python worker graceful failure

0 Answers0