1

I have a sequential function that sorts through lists and performs tasks. For example... (this is not the actual code but is analagous)

def myFunction(list):
   for item in list:
      sublist_a=item[0]
      sublist_b=item[1]
      sublist_c=item[2]
      sublist_d=item[3]
   for row in sublist_a:
      #(do tasks....)
   for row in sublist_b:
      #(do tasks....)
   for row in sublist_c:
      #(do tasks....)
   for row in sublist_d:
      #(do tasks....)
   print "COMPLETE"

So this is overly simplified, but essentially these lists are quire large, and the order of execution is important (ie. for row in ....), so I would like to split them between the available cores on my system.

Could someone please suggest a method for doing so?

Have never used the Multiprocessing library but it seems this is probably the best to use with python.

sidewaiise
  • 1,445
  • 3
  • 16
  • 27
  • "the order of execution is important", as in "needs to be done sequentially", as in "cannot be split between cores"? – Kijewski Sep 22 '14 at 02:51
  • some of the work can be split - for example, I think you could split the work for each of the `for` loops. But the loops would need to be executed one after the other. – sidewaiise Sep 22 '14 at 02:52
  • @sidewaiise What are you doing to each `row`? It seems that's the only piece that can actually be parallelized, right? – dano Sep 22 '14 at 02:55
  • @dano Yes this is true. Each loop simply cleans the list data in different ways. how do I parallelize it though? do I need to turn each of these loops into functions that utilize the `multiprocessing` lib? – sidewaiise Sep 22 '14 at 02:59

1 Answers1

4

You are looking for a multiprocessing.Pool

from multiprocessing import Pool

def function_to_process_a(row):
    return row * 42 # or something similar

# replace 4 by the number of cores that you want to utilize
with Pool(processes=4) as pool:
    # The lists are processed one after another,
    # but the items are processed in parallel.
    processed_sublist_a = pool.map(function_to_process_a, sublist_a)
    processed_sublist_b = pool.map(function_to_process_b, sublist_b)
    processed_sublist_c = pool.map(function_to_process_c, sublist_c)
    processed_sublist_d = pool.map(function_to_process_d, sublist_d)

Edit: As sidewaiise pointed out in the comments, it is preferable to use this pattern:

from contextlib import closing, cpu_count, Pool

with closing(Pool(processes=cpu_count())) as pool
    pass # do something
Kijewski
  • 25,517
  • 12
  • 101
  • 143
  • Actually Kay, Just to confirm, is sublist_a, sublist_b etc. inputs? Eg. similar to calling a `function_to_process_a(sublist_a)`? – sidewaiise Sep 22 '14 at 03:06
  • `sublist_a` should be iterable. `function_to_process_a` gets a single row as argument, just like the ordinary [`map(function, iterable)`](https://docs.python.org/3/library/functions.html#map). – Kijewski Sep 22 '14 at 03:11
  • Ok great. Now I get `AttributeError: __exit__`. Any ideas why this would happen? – sidewaiise Sep 22 '14 at 03:50
  • Actually i tried replacing the `with Pool(processes=2) as pool` statement with `pool = Pool(processes=2)` and the error was different: `PROCESSORS: 2 Exception in thread Thread-2: Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 808, in __bootstrap_inner self.run() File ".../2.7/lib/python2.7/threading.py", line 761, in run PicklingError: Can't pickle : attribute lookup __builtin__.function failed` – sidewaiise Sep 22 '14 at 04:19
  • You have to use `with Pool(…) as …:` – Kijewski Sep 22 '14 at 05:00
  • yes but I receive the `AttributeError: __exit__` if I use that. – sidewaiise Sep 22 '14 at 05:14
  • From your error message I guess that the elements in the sublists are not [pickle-able](https://docs.python.org/3/library/pickle.html). Are you working with database rows, that use proxies? For `pool.map()` you should only provide base types as argument (dict, list, tuple, str, bytes, int). – Kijewski Sep 22 '14 at 05:26
  • im using lists [] as the sublists. There are some nested lists though, for example `[var1, [str_a,str_b,str_c,...]]` the rows in these lists typically come from CSV. I was trying a csv_reader object but converted it to a list also. Error still occurs. – sidewaiise Sep 22 '14 at 05:32
  • Also... (thanks for helping on this one) -> my functions do not return anything. They simply write to log files at the end and return None. Is this a problem for using pool.map? – sidewaiise Sep 22 '14 at 05:41
  • 1
    had to use `from contextlib import closing`, then `while closing(Pool(processes=multiprocessing.cpu_count())) as pool`. Thanks, your answer basically answered the question. – sidewaiise Sep 22 '14 at 12:37
  • I never realized that one should close the Pool manually. Thanks, now we have taught each other. :) – Kijewski Sep 22 '14 at 13:37