5

I have a code like this:

    import multiprocessing
    from itertools import product,imap,ifilter

    def test(it):
        for x in it:
            print x     
        return None


    mp_pool = multiprocessing.Pool(multiprocessing.cpu_count())
    it = imap(lambda x: ifilter(lambda y: x+y > 10, xrange(10)), xrange(10))
    result = mp_pool.map(test, it)

I got error message:

     File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib64/python2.7/multiprocessing/pool.py", line 102, in worker
        task = get()
      File "/usr/lib64/python2.7/multiprocessing/queues.py", line 376, in get
        return recv()
        task = get()
      File "/usr/lib64/python2.7/multiprocessing/queues.py", line 376, in get
    TypeError: ifilter expected 2 arguments, got 0
        return recv()

Multiprocessing can't use a function with a iterator argument ? Thank you!

javin158
  • 51
  • 2
  • [This](https://stackoverflow.com/questions/44498644/multiprocessing-pool-with-an-iterator) thread may be related. – OctaveL Dec 20 '20 at 09:47

1 Answers1

0

Your iterator, it, has to produce single values (each value can be "complex", such as a tuple or a list). Right now we have:

>>> it
<itertools.imap object at 0x000000000283DB70>
>>> list(it)
[<itertools.ifilter object at 0x000000000283DC50>, <itertools.ifilter object at 0x000000000283DF98>, <itertools.ifilter object at 0x000000000283DBE0>, <itertools.ifilter object at 0x000000000283DF60>, <itertools.ifilter object at 0x000000000283DB00>, <itertools.ifilter object at 0x000000000283DCC0>, <itertools.ifilter object at 0x000000000283DD30>, <itertools.ifilter object at 0x000000000283DDA0>, <itertools.ifilter object at 0x000000000283DE80>, <itertools.ifilter object at 0x000000000284F080>]

Each iteration of it would produce another iterator, and that is the cause of your problem.

So you have to "iterate your iterators":

import multiprocessing
from itertools import imap, ifilter
import sys


def test(t):
    return 't = ' + str(t) # return value rather than printing


if __name__ == '__main__': # required for Windows
    mp_pool = multiprocessing.Pool(multiprocessing.cpu_count())
    it = imap(lambda x: ifilter(lambda y: x+y > 10, xrange(10)), xrange(10))
    for the_iterator in it:
        result = mp_pool.map(test, the_iterator)
        print result
    mp_pool.close() # needed to ensure all processes terminate
    mp_pool.join() # needed to ensure all processes terminate

The results printed, as you have defined it, is:

[]
[]
['t = 9']
['t = 8', 't = 9']
['t = 7', 't = 8', 't = 9']
['t = 6', 't = 7', 't = 8', 't = 9']
['t = 5', 't = 6', 't = 7', 't = 8', 't = 9']
['t = 4', 't = 5', 't = 6', 't = 7', 't = 8', 't = 9']
['t = 3', 't = 4', 't = 5', 't = 6', 't = 7', 't = 8', 't = 9']
['t = 2', 't = 3', 't = 4', 't = 5', 't = 6', 't = 7', 't = 8', 't = 9']

But if you want to get the most out of multiprocessing (assuming you have enough processors), then you would use map_async so that all of the jobs can be submitted at once:

import multiprocessing
from itertools import imap, ifilter
import sys


def test(t):
    return 't = ' + str(t) # return value rather than printing


if __name__ == '__main__': # required for Windows
    mp_pool = multiprocessing.Pool(multiprocessing.cpu_count())
    it = imap(lambda x: ifilter(lambda y: x+y > 10, xrange(10)), xrange(10))
    results = [mp_pool.map_async(test, the_iterator) for the_iterator in it]
    for result in results:
        print result.get()
    mp_pool.close() # needed to ensure all processes terminate
    mp_pool.join() # needed to ensure all processes terminate

Or you might consider using my_pool.imap, which, unlike, my_pool.map_async, does not first convert the iterable argument to a list to determine an optimal chunksize value to use for submitting jobs (read the documentation, which is not great), but by default uses a chunksize value of 1, which is usually not desirable for very large iterables:

results = [mp_pool.imap(test, the_iterator) for the_iterator in it]
for result in results:
    print list(result) # to get a comparable printout as when using map_async

Update: Use multiprocessing to generate lists

import multiprocessing
from itertools import imap, ifilter
import sys


def test(t):
    return 't = ' + str(t) # return value rather than printing

def generate_lists(x):
    return list(ifilter(lambda y: x+y > 10, xrange(10)))

if __name__ == '__main__': # required for Windows
    mp_pool = multiprocessing.Pool(multiprocessing.cpu_count())
    lists = mp_pool.imap(generate_lists, xrange(10))
    # lists, returned by mp_pool.imap, is an iterable
    # as each element of lists becomes available it is passed to test:
    results = mp_pool.imap(test, lists)
    # as each result becomes available
    for result in results:
        print result
    mp_pool.close() # needed to ensure all processes terminate

Prints:

t = []
t = []
t = [9]
t = [8, 9]
t = [7, 8, 9]
t = [6, 7, 8, 9]
t = [5, 6, 7, 8, 9]
t = [4, 5, 6, 7, 8, 9]
t = [3, 4, 5, 6, 7, 8, 9]
t = [2, 3, 4, 5, 6, 7, 8, 9]
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • sorry, my example code confused you ! My actual code is that each iteration of iterator would produce another iterator. In my real code, the produced iterator will be time-consuming to yield value, so I want to put the produced iterator to a process to yield value. – javin158 Dec 20 '20 at 15:24
  • I've updated the answer. I am not sure if your iterator, `it`, produces the results you expect. – Booboo Dec 20 '20 at 15:42
  • The difference between my code and your code is that I put the iterator as the function's argument. In my real code, the iterator will be time-consuming to yield value, so I want to put the iterator to a process to yield value. – javin158 Dec 21 '20 at 11:54
  • The difference between your code and my code is that your code is illegal. `results = [mp_pool.map_async(test, the_iterator) for the_iterator in it]` (or the next version using `mp_pool.imap`) will parallelize the processing as much as possible (depending on how many CPUs you actually have). If you are saying that the iterator itself is time-consuming, nothing in your code uses multiprocessing to generate the iterator. Are saying you would like to use multiprocessing to generate the iterator? – Booboo Dec 21 '20 at 12:06
  • I want to know why my code is illegal. I'd like to use multiprocessing to iterate many iterators inside many processes. – javin158 Dec 21 '20 at 14:07
  • Your *current* code is invalid and throws an exception. That's what I meant. – Booboo Dec 21 '20 at 14:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226246/discussion-between-booboo-and-javin158). – Booboo Dec 21 '20 at 14:40
  • I can fix my code like this: `it = imap(lambda x: list(ifilter(lambda y: x+y > 10, xrange(10))), xrange(10))` I replace the iterator to a list for the test function. But I really want to iterate iterator in the test function, not outside – javin158 Dec 21 '20 at 14:41
  • You could pass the iterator as an argument to `test` using either `mp_pool.apply` or mp_pool.apply_async` and then iterate it all you want in function `test`. But you would not be doing any multiprocessing of the iterations. All of the iterating would be done in a single worker process. – Booboo Dec 21 '20 at 14:51
  • My code does the iterating of the iterator in the main process and takes the resulting iterations and processes those in as many parallel worker sub-processes as possible (i.e. function `test`) assuming there is actual work to be done with the result of each iteration. – Booboo Dec 21 '20 at 14:54
  • I've also added another version that uses multiprocessing to generate the lists and as each list becomes available it is passed to `test` for processing. – Booboo Dec 21 '20 at 15:24