2

Use Case: Enormous image processing. I employ mem-mapped temporary files when the intermeditate dataset exceeds physical memory. I have no need to store intermediate results to disk after I'm done with them. When I delete them, numpy seems to flush all their contents to disk first, then remove the file from the file system. The flushing is taxing the I/O resources and file system which, to my understanding, is logically unnecessary given the file is just deleted afterwards.

Is it possible to close a memmap'd temporary file without flushing its contents?

Jesse Meyer
  • 315
  • 1
  • 3
  • 12

1 Answers1

4

You need to open your memory map as copy-on-write, with the c mode. From the numpy.memmap documentation:

mode : {'r+', 'r', 'w+', 'c'}, optional

The file is opened in this mode:

'r'     Open existing file for reading only.
'r+'    Open existing file for reading and writing.
'w+'    Create or overwrite existing file for reading and writing.
'c'     Copy-on-write: assignments affect data in memory, but changes 
        are not saved to disk. The file on disk is read-only.

Default is 'r+'.

So the default is to allow for reading and writing, but altering a memory-mapped file in this manner will indeed cause all changes to written back. Flushing the changes can happen at any time, but a flush certainly will take place when you close it.

When you use c as the mode, changes will cause the changed page to be copied (transparently), and pages thus affected are discarded again when you close the file.

Note that when you write to enough pages, the OS will have to swap memory pages to disk. This is no different from any other process using more memory than is available. When you close the mmapped file, any such copied pages (swapped to disk or still in memory) are discarded again.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks Martijn. I had noticed 'c' mode, but resisted trying it because I was unsure what 'saved' meant. I do not want paged out data to be lost, because I may need to retrieve it again. Does 'c' mode protect paged out data? In other words, I do not want persistent data between inter-processes (save to disk), but I do want persistent data intra-process (temporary stores). – Jesse Meyer Jul 06 '18 at 14:31
  • 1
    @JesseMeyer: when you try to access the page that was swapped out, the OS will swap it back in. The changes are not lost, you don't go back to the original file contents. – Martijn Pieters Jul 06 '18 at 14:39
  • 1
    @JesseMeyer: copy-on-write gives you per-process private memory pages to store your changes, so the data is persistent intra-process. – Martijn Pieters Jul 06 '18 at 14:40
  • I've been working on making this change to test if it works before I accept this as the answer and I've run into problems that are not directly part of this question. Once I iron them out, I will report back. – Jesse Meyer Jul 06 '18 at 16:34
  • 1
    So have you flushed the named temp file contents to disk? – Martijn Pieters Jul 06 '18 at 18:08
  • I'm wanting to avoid precisely that, if that means flushing hundreds of GIGs of data to disk, just to wipe them away. – Jesse Meyer Jul 06 '18 at 22:33
  • 1
    I’m not following then. A copy-on-write mmap needs data to read to begin with. If you are filling an empty array backed by a writable mmaped file on disk, the OS expects you want to write the data to disk eventually; it’s not going to be a memory extension that way. Just use regular arrays and rely on normal memory management to swap in and out. – Martijn Pieters Jul 07 '18 at 13:03
  • 1
    If you feel you get more control over memory by using a temp file backed mmapped array, then clear the array contents before discarding it. – Martijn Pieters Jul 07 '18 at 13:05
  • Thanks. I'll try clearing out the memory before discarding. Using regular arrays were causing out of memory problems, where, presumably, the underlying allocator was just giving up after physical ram was used up and not relying on the kernel to swap. I've accepted your response as the answer. – Jesse Meyer Jul 07 '18 at 23:55