1

I've got some Python code that farms out expensive jobs using ThreadPoolExecutor, and I'd like to keep track of which of them have completed so that if I have to restart this system, I don't have to redo the stuff that already finished. In a single-threaded context, I could just mark what I've done in a shelf. Here's a naive port of that idea to a multithreaded environment:

from concurrent.futures import ThreadPoolExecutor
import subprocess
import shelve


def do_thing(done, x):
    # Don't let the command run in the background; we want to be able to tell when it's done
    _ = subprocess.check_output(["some_expensive_command", x])
    done[x] = True


futs = []
with shelve.open("done") as done:
    with ThreadPoolExecutor(max_workers=18) as executor:
        for x in things_to_do:
            if done.get(x, False):
                continue
            futs.append(executor.submit(do_thing, done, x))
            # Can't run `done[x] = True` here--have to wait until do_thing finishes
        for future in futs:
            future.result()

    # Don't want to wait until here to mark stuff done, as the whole system might be killed at some point
    # before we get through all of things_to_do

Can I get away with this? The documentation for shelve doesn't contain any guarantees about thread safety, so I'm thinking no.

So what is the simple way to handle this? I thought that perhaps sticking done[x] = True in future.add_done_callback would do it, but that will often run in the same thread as the future itself. Perhaps there is a locking mechanism that plays nicely with ThreadPoolExecutor? That seems cleaner to me that writing a loop that sleeps and then checks for completed futures.

Community
  • 1
  • 1
kuzzooroo
  • 6,788
  • 11
  • 46
  • 84

1 Answers1

1

While you're still in the outer-most with context manager, the done shelve is just a normal python object- it is only written to disk when the context manager closes and it runs its __exit__ method. It is therefore just as thread safe as any other python object, due to the GIL (as long as you're using CPython).

Specifically, the reassignment done[x] = True is thread safe / will be done atomically.

It's important to note that while the shelve's __exit__ method will run after a Ctrl-C, it won't if the python process ends abruptly, and the shelve won't be saved to disk.

To protect against this kind of failure, I would suggest using a lightweight file-based thread safe database like sqllite3.

Julien
  • 5,243
  • 4
  • 34
  • 35
  • The context manager's `__exit__` gets called even in the case of, for example, a KeyboardInterrupt exception, so my state does seem to get persisted. As for thread safety, are you saying that all Python objects are thread-safe because of the GIL? – kuzzooroo Sep 30 '16 at 23:18
  • Not all python objects in their totality are thread safe, but (at least in CPython) you're using basic atomic operations (assignment/reassignment) which do not include I/O (the write happens after the `__exit__`), so you will be safe. – Julien Oct 01 '16 at 00:34
  • I do want to add though that your code is a bit incorrect- you should not be calling `do_thing` but rather passing it as the first argument. Also, you should be storing the return value of `executor.submit` into a list (conventionally called `futs`). Then, within the `ThreadPoolExecutor` context, loop through the list calling each object's `result()` method. This will block the interpreter from continuing until all the tasks are completed. – Julien Oct 01 '16 at 00:40
  • Thank you. I've made the fix of pass `do_thing` as an argument to `executor.submit` instead of calling it. Do I need to call each future's `result()` method if I am using a context manager? The [documentation for `Executor.shutdown()`](https://docs.python.org/dev/library/concurrent.futures.html#concurrent.futures.Executor.shutdown) says, "You can avoid having to call this method explicitly if you use the `with` statement, which will shutdown the `Executor` (waiting as if `Executor.shutdown()` were called with `wait` set to `True`)." – kuzzooroo Oct 02 '16 at 17:28
  • You should be calling each future's `result()` method no matter what, because otherwise it will swallow all exceptions. If there was an exception when running the submitted function, it will raise when you get the result from its corresponding future object. Otherwise, it will fail silently and you'll never know anything went wrong. – Julien Oct 02 '16 at 23:36
  • OK, edited the code so I'm now calling `result` on each future – kuzzooroo Oct 03 '16 at 00:42
  • I think the key point that's in the comments but not yet in the answer is that `done[x] = True` specifically is thread-safe. The answer right now just says it's safe because it's a Python object but I think it's a bit too broad, e.g., you couldn't iterate through a `dict` in one thread and modify its keys in another. – kuzzooroo Oct 03 '16 at 02:41