10

I have a list of elements that I want to modify using multiprocessing. The issue is that for some particular inputs (unobservable prior to attempting), part of the my function stalls. I've shown this conceptually with the code below, where the function sometimes_stalling_processing() will occasionally stall indefinitely.

To put this into context, I'm processing a bunch of links using a web scraper and some of these links stall even with the use of timeout in the requests module. I've attempted different approaches (e.g. using eventlet), but come to the conclusion that it's perhaps easier to handle it at the multiprocessing level.

def stable_processing(obs):
    ...
    return processed_obs

def sometimes_stalling_processing(obs):
    ...
    return processed_obs

def extract_info(obs):
    new_obs = stable_processing(obs)
    try:
        new_obs = sometimes_stalling_processing(obs)
    except MyTimedOutError: # error doesn't exist, just here for conceptual purposes
        pass
    return new_obs

pool = Pool(processes=n_threads)
processed_dataset = pool.map(extract_info, dataset)
pool.close()
pool.join()

This question (How can I abort a task in a multiprocessing.Pool after a timeout?) seems very similar, but I've been unable to convert it to work with map instead of apply. I've also tried using the eventlet package, but that doesn't work. Note that I'm using Python 2.7.

How do I make pool.map() timeout on individual observations and kill sometimes_stalling_processing?

noxdafox
  • 14,439
  • 4
  • 33
  • 45
pir
  • 5,513
  • 12
  • 63
  • 101

1 Answers1

16

You can take a look at the pebble library.

from pebble import ProcessPool
from concurrent.futures import TimeoutError

def sometimes_stalling_processing(obs):
    ...
    return processed_obs

with ProcessPool() as pool:
    future = pool.map(sometimes_stalling_processing, dataset, timeout=10)

    iterator = future.result()

    while True:
        try:
            result = next(iterator)
        except StopIteration:
            break
        except TimeoutError as error:
            print("function took longer than %d seconds" % error.args[1])

More examples in the documentaion.

noxdafox
  • 14,439
  • 4
  • 33
  • 45
  • Looks pretty amazing! Does it also work with Python 2.7? – pir Jun 09 '17 at 15:51
  • Is there no need for pool.close() or pool.join() with `pebble`? – pir Jun 09 '17 at 15:55
  • Yes it is designed to work both on Python 2 and 3. The contextmanager calls `close` and `join` for you. Otherwise, you indeed need to close the pool and join it. – noxdafox Jun 09 '17 at 17:31
  • 1
    Okay, sounds perfect. However, how do I identify the specific input that made the function time out? It could either be an index into the `dataset` or just the `dataset[index]` itself. – pir Jun 09 '17 at 17:33
  • The results are returned in the same order they are submitted. Therefore, you can simply increment an index at every iteration and use it to address the problematic elements. – noxdafox Jun 09 '17 at 17:35
  • Awesome. This works beautifully. Did you make the library? – pir Jun 09 '17 at 17:36
  • Yes, the `pools` were added as there was no implementation supporting problematic tasks such as hanging or crashing (segfaulting) ones. The solution you linked above, for example, would not work if your `sometimes_stalling_processing` function would call a lower level C API. If it would hang in a C loop, the other thread would never gain control back. Hence it would never be able to stop the process execution. – noxdafox Jun 09 '17 at 17:38
  • 1
    Makes sense. You definitely deserve a bounty for this great work. You'll get it as soon as stackoverflow allows me to give it to you. – pir Jun 09 '17 at 17:42
  • I'm experiencing issues with webscraping code based on the example above not correctly timing out. For instance, I run with 100 threads, but after about an hour I can see that no new links are being scraped. The whole code simply just stalled. Have you experienced any similar issues? Could it be related to some of the other libraries being used for the scraping perhaps not being multi-threading compliant? – pir Aug 09 '17 at 18:00
  • If you are running the scraping code within a Pool of processes (and not threads) I doubt it's capable of interferring with Pebble. Are you sure you are not seeing the timeouts? If you are scraping too fast, the websites might be backpressuring you to avoid DOS. I'd anyway ask you to open an issue on the Github project. Stackoverflow is not very well suited for these kind of conversations. Please add in the issue a minimal example which reproduces the issue. – noxdafox Aug 11 '17 at 08:53