0

I have a list of strings and on every string I am doing some changes that you can see in wordify(). Now, to speed this up, I split up the list into sublists using chunked() (the number of sublists is the number of CPU cores - 1). That way I get lists that look like [[,,],[,,],[,,],[,,]] .

What I try to achieve:

I want to do wordify() on every of these sublists simultaneously, returning the sublists as separate lists. I want to wait until all processes finish and then join these sublists into one list. The approach below does not work.

import multiprocessing
from multiprocessing import Pool
from contextlib import closing

def readFiles():
    words = []
    with open("somefile.txt") as f:
        w = f.readlines()
    words = words + w 
    return words


def chunked(words, num_cpu):
    avg = len(words) / float(num_cpu)
    out = []
    last = 0.0    
    while last < len(words):
        out.append(words[int(last):int(last + avg)])
        last += avg    
    return out    


def wordify(chunk,wl):
    wl.append([chunk[word].split(",", 1)[0] for word in range(len(chunk))]) 
    return wl


if __name__ == '__main__':
    num_cpu = multiprocessing.cpu_count() - 1
    words = readFiles()
    chunked = chunked(words, num_cpu)
    wordlist = []
    wordify(words, wordlist) # works
    with closing(Pool(processes = num_cpu)) as p:
        p.map(wordify, chunked, wordlist) # fails
doc
  • 127
  • 1
  • 11

1 Answers1

1

You have write your code so that you're just passing a single function to map; it's not smart enough to know that your hoping it passes wordlist into the second argument of your function.

TBH partial function application is a bit clunky in Python, but you can use functools.partial:

from functools import partial
p.map(partial(wordify, wordlist), chunked)
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • First solution gives me: `cPickle.PicklingError: Can't pickle : attribute lookup __builtin__.function failed` Second solution works but it is slower than doing it singlethreaded: `singlethreaded: 0.423000097275`, `multithreaded: 0.605999946594` – doc Aug 01 '16 at 01:05
  • You shouldn't expect a speed boost unless the work being done is substantial. Splitting a line by the first comma is little work, and it's almost certainly more time-consuming just to send the strings back and forth over interprocess pipes. It's not really that the latter is notably expensive, it's more that what you're _doing_ with the strings is very cheap. – Tim Peters Aug 01 '16 at 01:17
  • The amount of data this will be done to will increase substantially. Is there a more efficient way to do this as I absolutely do not see any need to send strings between processes while the sublists are being modified. Stuff only should be sent back and forth when rejoining the sublists. – doc Aug 01 '16 at 01:24
  • What I'm saying has nothing to do with the amount of data. It's solely about the expense of interprocess communication versus the expense of "useful work" done. _Everything_ you pass as an argument is pickled (serialized into a string format) and sent over an interprocess pipe, and likewise for _everything_ returned by a worker process. The processes do not share memory - all communication of all values between processes occurs by sending pickle strings over pipes. – Tim Peters Aug 01 '16 at 01:28
  • Ok thanks. I have a similar use case where I use a lot of regex which should be heavier on CPU. I'll see if it is more effective there. – doc Aug 01 '16 at 01:36
  • i forgot that lambda doesn't work, i'll remove that from the answer – maxymoo Aug 01 '16 at 01:38
  • 1
    If you are looking for a multi-argument `map` and the ability to serialize `lambda` and most other python objects, you might want to check out the fork of `multiprocessing` in `pathos.multiprocessing`. (shameless plug - I'm the author) – Mike McKerns Aug 15 '16 at 14:25