While using pp to parallelize a significantly complex machine learning problem I'm finding myself having to rely fairly extensively on third party libraries which are of varying quality. One in particular has a decent amount of edge case crashes when used intensively on varying datasets. I will eventually have to solve these, but in the short term it is too much to try to fix both my bugs and theirs - and this library is really the best one.
My question is: Is there an established pattern to be used to allow for graceful failure of local worker processes in pp?
The options as I see them are:
- Don't use ANY local worker processes, use only REMOTE workers - and then rely on the socket timeout.
- Shell all work out to a secondary python script which I wrap and execute as a separate process, then just use the exit code to check for crashes. This would probably have to be combined with a timeout as well to guard for non-segfault failure cases.
Am I missing something here? I've been looking at pp.py and as far as I can tell there is no exit detection on the worker processes.