2

The problem

My application is extracting a list of zip files in memory and writing the data to a temporary file. I then memory map the data in the temp file for use in another function. When I do this in a single process, it works fine, reading the data doesn't affect memory, max RAM is around 40MB. However when I do this using concurrent.futures the RAM goes up to 500MB.

I have looked at this example and I understand I could be submitting the jobs in a nicer way to save memory during processing. But I don't think my issue is related, as I am not running out of memory during processing. The issue I don't understand is why it is holding onto the memory even after the memory maps are returned. Nor do I understand what is in the memory, since doing this in a single process does not load the data in memory.

Can anyone explain what is actually in the memory and why this is different between single and parallel processing?

PS I used memory_profiler for measuring the memory usage

Code

Main code:

def main():
    datadir = './testdata'
    files = os.listdir('./testdata')
    files = [os.path.join(datadir, f) for f in files]
    datalist = download_files(files, multiprocess=False)
    print(len(datalist))
    time.sleep(15)
    del datalist # See here that memory is freed up
    time.sleep(15)

Other functions:

def download_files(filelist, multiprocess=False):
    datalist = []
    if multiprocess:
        with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
            returned_future = [executor.submit(extract_file, f) for f in filelist]
        for future in returned_future:
            datalist.append(future.result())
    else:
        for f in filelist:
            datalist.append(extract_file(f))
    return datalist

def extract_file(input_zip):
    buffer = next(iter(extract_zip(input_zip).values()))
    with tempfile.NamedTemporaryFile() as temp_logfile:
        temp_logfile.write(buffer)
        del buffer
        data = memmap(temp_logfile, dtype='float32', shape=(2000000, 4), mode='r')
    return data

def extract_zip(input_zip):
    with ZipFile(input_zip, 'r') as input_zip:
        return {name: input_zip.read(name) for name in input_zip.namelist()}

Helper code for data

I can't share my actual data, but here's some simple code to create files that demonstrate the issue:

for i in range(1, 16):
    outdir = './testdata'
    outfile = 'file_{}.dat'.format(i)
    fp = np.memmap(os.path.join(outdir, outfile), dtype='float32', mode='w+', shape=(2000000, 4))
    fp[:] = np.random.rand(*fp.shape)
    del fp
    with ZipFile(outdir + '/' + outfile[:-4] + '.zip', mode='w', compression=ZIP_DEFLATED) as z:
        z.write(outdir + '/' + outfile, outfile)
S.B.G
  • 290
  • 3
  • 16
  • 2
    I'm not sure you can pass a `np.memmap` through `pickle` this way. Can you verify that `future.result()._mmap` is a map of the same file as `data._mmap` in the child tasks? – abarnert Aug 14 '18 at 15:35
  • `future.result()` type is an np.memmap, I can see that it is returning the correct data. However, `future.result()._mmap` is showing as `None`, whereas `data._mmap` in the child function is showing as `mmap.mmap` object. What do you mean by I can't pass the `np.memmap` through `pickle`? – S.B.G Aug 14 '18 at 15:56
  • 1
    Multiprocessing (including via `concurrent.futures`) passes data between processes by pickling it, passing the pickle over an IPC pipe or queue, then unpickling on the other side. If you try this with an `mmap.mmap` you get an exception. If you try it with a `np.memmap` I’m not sure what happens, but my suspicion is that it pickles the array with all of its data, sends that, and unpickles it, so instead of getting an array mapping the same file, you end up with a copy of the array in newly-allocated memory. Which would explain exactly what you’re seeing. – abarnert Aug 14 '18 at 17:15
  • I don’t know what it means when `data._mmap` is None (I’ll look it up when I get a chance), but that doesn’t sound promising—it seems at least possible that means your `memmap` really is a copy rather than a `mmap` under the covers. If so, the simplest solution is probably to just pass the filename back as the result, and have the main process memmap that filename. – abarnert Aug 14 '18 at 17:18
  • You're right, passing `mmap.mmap` causes `BrokenProcessPool`. I think you're right with `np.memmap` too, it seems to be passing a copy of the data. Its difficult to verify since the returned type still says `np.memmap` but the memory usage is pretty much the size of the data. – S.B.G Aug 15 '18 at 09:22
  • @abarnert the solution to this was to re-write so that the process just passes the file location, as you said. However this means I cannot make my temp file auto-delete which could cause clean up issues. I think the better solution may be to restrucure so that each process is running the full analysis on a set of files, then returning the result. This is a bit of a re-write though, so I think this issue can be closed, do you want to post your answers? – S.B.G Aug 30 '18 at 08:40
  • OK, posted. Meanwhile, if you want to do a bit of research into this, it might be a bug (worth filing against numpy, that `memmap` objects can be pickled even when they have an underlying `mmap` that can't, but it might be an intentional design decision (although even then, it may be a docs bug that the `memmap` docs don't mention that). – abarnert Aug 30 '18 at 17:41

1 Answers1

1

The problem is that you're trying to pass an np.memmap between processes, and that doesn't work.

The simplest solution is to instead pass the filename, and have the child process memmap the same file.


When you pass an argument to a child process or pool method via multiprocessing, or return a value from one (including doing so indirectly via a ProcessPoolExecutor), it works by calling pickle.dumps on the value, passing the pickle across processes (the details vary, but it doesn't matter whether it's a Pipe or a Queue or something else), and then unpickling the result on the other side.

A memmap is basically just an mmap object with an ndarray allocated in the mmapped memory.

And Python doesn't know how to pickle an mmap object. (If you try, you will either get a PicklingError or a BrokenProcessPool error, depending on your Python version.)

A np.memmap can be pickled, because it's just a subclass of np.ndarray—but pickling and unpickling it actually copies the data and gives you a plain in-memory array. (If you look at data._mmap, it's None.) It would probably be nicer if it gave you an error instead of silently copying all of your data (the pickle-replacement library dill does exactly that: TypeError: can't pickle mmap.mmap objects), but it doesn't.


It's not impossible to pass the underlying file descriptor between processes—the details are different on every platform, but all of the major platforms have a way to do that. And you could then use the passed fd to build an mmap on the receiving side, then build a memmap out of that. And you could probably even wrap this up in a subclass of np.memmap. But I suspect if that weren't somewhat difficult, someone would have already done it, and in fact it would probably already be part of dill, if not numpy itself.

Another alternative is to explicitly use the shared memory features of multiprocessing, and allocate the array in shared memory instead of a mmap.

But the simplest solution is, as I said at the top, to just pass the filename instead of the object, and let each side memmap the same file. This does, unfortunately, mean you can't just use a delete-on-close NamedTemporaryFile (although the way you were using it was already non-portable and wouldn't have worked on Windows the same way it does on Unix), but changing that is still probably less work than the other alternatives.

abarnert
  • 354,177
  • 51
  • 601
  • 671