5

I have code like this:

def generator():
    while True:
        # do slow calculation
        yield x

I would like to move the slow calculation to separate process(es).

I'm working in python 3.6 so I have concurrent.futures.ProcessPoolExecutor. It's just not obvious how to concurrent-ize a generator using that.

The differences from a regular concurrent scenario using map is that there is nothing to map here (the generator runs forever), and we don't want all the results at once, we want to queue them up and wait until the queue is not full before calculating more results.

I don't have to use concurrent, multiprocessing is fine also. It's a similar problem, it's not obvious how to use that inside a generator.

Slight twist: each value returned by the generator is a large numpy array (10 megabytes or so). How do I transfer that without pickling and unpickling? I've seen the docs for multiprocessing.Array but it's not totally obvious how to transfer a numpy array using that.

Alex I
  • 19,689
  • 9
  • 86
  • 158

1 Answers1

0

In this type of situation I usually use the joblib library. It is a parallel computation framework based on multiprocessing. It supports memmapping precisely for the cases where you have to handle large numpy arrays. I believe it is worth checking for you.

Maybe joblib's documentation is not explicit enough on this point, showing only examples with for loops, since you want to use a generator I should point out that it indeed works with generators. An example that would achieve what you want is the following:

from joblib import Parallel, delayed
def my_long_running_job(x):
    # do something with x
# you can customize the number of jobs
Parallel(n_jobs=4)(delayed(my_long_running_job)(x) for x in generator())

Edit: I don't know what kind of processing you want to do, but if it releases the GIL you could also consider using threads. This way you won't have the problem of having to transfer large numpy arrays between processes, and still beneficiate from true parallelism.

Anis
  • 2,984
  • 17
  • 21
  • Other than saying "Use library X", this does not really seem to address the actual question asked. If you had questions or needed clarification, as noted in your edit, you should've posted a comment. If you provide an explanation on how, exactly, a generator can be used with it as OP needs (the examples I saw in your link were not generators) and OP has no issues adding an external library dependency, *then* I'll remove my -1. – code_dredd Mar 25 '17 at 23:18
  • I disagree I did more than that. I gave OP 2 ways of solving his problem on the two aspects he mentions: the actual parallelizing and dealing with large numpy arrays as return values. More information isn't really relevant since I touched on in my question the only relevant information that be needed: is the GIL released. I find your judgement a bit harsh. So yeah I gave him pointer to a library but actually did more than that? – Anis Mar 25 '17 at 23:22
  • The response that you've posted *here* in SO does 3 things: 1) points OP to a non-standard library (don't know whether that is (not) an issue for OP); 2) says nothing about how to use the library with a *generator* or at all (not even a simple example); 3) goes on to talk about *concurrency* w/ threads in a question explicitly about *parallelism*, which is different. This post amounts to a more verbose link-only answer, which is generally frowned upon. So no, I don't think this post actually addresses the question or helps OP solve the problem. How is this different from just posting the URL? – code_dredd Mar 25 '17 at 23:29
  • 1) OP said multiprocessing is OK, and I explicitly say that joblib is only a wrapper around multiprocessing. 2) The example is straightforward from the doc, there is really only one usage and it works with generators, but Oki can show an example 3) I suggest to OP the he can both achieve parallelism with threads if his job releases the lock, in which case it would be a solution to the second problem he raises. – Anis Mar 25 '17 at 23:33
  • I could also add an example with memmap, but since it is really dependent to OP's needs, I feel like the good thing to do is to tell him that such capability exists, and point him to the documentation which will be far better than what I can possibly do in a SO answer... – Anis Mar 25 '17 at 23:38
  • The GIL lock is an interpreter implementation detail, which was a key reason for creating the `multiprocessing` module, since threads cannot bypass this limitation. (Concurrency != Parallelism) We generally want SO to be a self-contained Q&A page; any answers that just point someone else to some off-site resource that may cease to exist sooner or later are generally considered to *not* be good answers. Again, OP may (not) be able to add library dependencies; the post suggests the valid options are `concurrent` & `multiprocessing` modules, rather than external libs. – code_dredd Mar 25 '17 at 23:40
  • If you need clarification from OP, don't go around guessing in your answer. Just ask OP a question in the comments. – code_dredd Mar 25 '17 at 23:40
  • I understand your reasons. I just felt like my answer provided valuable pointers to OP with explanations as to why. Since OP seems to be aware of what he is doing, I felt like that was sufficient for him. Besides, OP is not the only one concerned, maybe he can't install lib (he doesn't say so) but some reader may not have such limitation and might look for an extremely easy way to do parallelisme. And yes, it is parallelism I am talking about, not concurrency. Finally I also had the impression that SO what not a "do it for me" platform. And with this respect I felt I had said enough. – Anis Mar 25 '17 at 23:48
  • @Anis Thanks for the suggestion! Using joblib is fine, it looks pretty nice from the docs and I like the /dev/shm mechanism. Just one problem... where you have `delayed(my_long_running_job)(x) for x in generator`, actually in my case the long running job is what is *inside* the generator (it runs to produce every x). Basically what you have is "How to use joblib to run slow code to process the output of a generator" and what I need is "How to use something to run slow code to produce the output of a generator" – Alex I Mar 25 '17 at 23:51
  • @Anis On SO not being a "do it for me" platform, It think you're mostly right; this does not look like a college homework assignment to me, which is where I'd think in those terms. Anyway, I think I've also said enough. I think we'll have to agree to disagree. Like I said, if the answer is improved and OP ends up with an answered question, then I'll remove my -1. I'm not one of those people who down-votes and disappears into the ether forever. – code_dredd Mar 25 '17 at 23:54
  • Alright, then you could use parallelization to populate a list to yield thanks to joblib. I think that might solve your problem. If you detail more your question, hopefully I can refine my answer and definitely opt between multiprocessing + mmap or the threading backend for parallelization – Anis Mar 25 '17 at 23:54
  • @Anis "parallelization to populate a list to yield" - as long as the list is only used as a temporary queue. My code generates around 1GB/second and runs for days, there is no way it can store the entire output - that's kinda the reason for using a generator :) – Alex I Mar 25 '17 at 23:58
  • @ray I agree with the points you raised and on the ways my answer could be improved. I just wasn't convinced that its flaws qualified it for being offtrack :) – Anis Mar 25 '17 at 23:59
  • @ray Ok then I have plenty of other suggestions :) for instance you could proceed the way the deep learning library keras does, by polling a generator in parallel! Or also add some control over the list size to mitigate the data flow. – Anis Mar 26 '17 at 00:01