24

I'm working with a bunch of large numpy arrays, and as these started to chew up too much memory lately, I wanted to replace them with numpy.memmap instances. The problem is, now and then I have to resize the arrays, and I'd preferably do that inplace. This worked quite well with ordinary arrays, but trying that on memmaps complains, that the data might be shared, and even disabling the refcheck does not help.

a = np.arange(10)
a.resize(20)
a
>>> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

a = np.memmap('bla.bin', dtype=int)
a
>>> memmap([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

a.resize(20, refcheck=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-f1546111a7a1> in <module>()
----> 1 a.resize(20, refcheck=False)

ValueError: cannot resize this array: it does not own its data

Resizing the underlying mmap buffer works perfectly fine. The problem is how to reflect these changes to the array object. I've seen this workaround, but unfortunately it doesn't resize the array in place. There is also some numpy documentation about resizing mmaps, but it's clearly not working, at least with version 1.8.0. Any other ideas, how to override the inbuilt resizing checks?

Michael
  • 7,316
  • 1
  • 37
  • 63

2 Answers2

16

The issue is that the flag OWNDATA is False when you create your array. You can change that by requiring the flag to be True when you create the array:

>>> a = np.require(np.memmap('bla.bin', dtype=int), requirements=['O'])
>>> a.shape
(10,)
>>> a.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
>>> a.resize(20, refcheck=False)
>>> a.shape
(20,)

The only caveat is that it may create the array and make a copy to be sure the requirements are met.

Edit to address saving:

If you want to save the re-sized array to disk, you can save the memmap as a .npy formatted file and open as a numpy.memmap when you need to re-open it and use as a memmap:

>>> a[9] = 1
>>> np.save('bla.npy',a)
>>> b = np.lib.format.open_memmap('bla.npy', dtype=int, mode='r+')
>>> b
memmap([0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Edit to offer another method:

You may get close to what you're looking for by re-sizing the base mmap (a.base or a._mmap, stored in uint8 format) and "reloading" the memmap:

>>> a = np.memmap('bla.bin', dtype=int)
>>> a
memmap([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> a[3] = 7
>>> a
memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0])
>>> a.flush()
>>> a = np.memmap('bla.bin', dtype=int)
>>> a
memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0])
>>> a.base.resize(20*8)
>>> a.flush()
>>> a = np.memmap('bla.bin', dtype=int)
>>> a
memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
wwwslinger
  • 936
  • 8
  • 14
  • 2
    Interesting. Unfortunately, for me it looks like it always creates a copy in memory. If I try writing to the array, flushing, deleting, and reopening the array, it's empty again as before. So I guess the data is never really written to the disk. – Michael Jan 10 '14 at 07:56
  • I added an example of how you could save it and re-open later as a memmap – wwwslinger Jan 13 '14 at 02:35
  • @wwwslinger The problem with your answer is that if `a` is too big to fit in core memory (why else would you use a memory-mapped array?), then creating another copy of it in core is clearly going to cause some problems. You'd be better off creating a new memory-mapped array with the correct size from scratch, then filling it in chunks with the contents of `a`. – ali_m Jan 13 '14 at 04:21
  • ali_m is right. Saving is not the main issue, it's only a symptom, that the data is not referenced anymore, but copied when using `np.require`. I'd like to accept the answer, but unfortunately it still doesn't fix the problem: %memit a.resize(int(100e6), refcheck=False) >>> peak memory: 819.27 MiB, increment: 763.00 MiB – Michael Jan 13 '14 at 20:00
  • For the last edit: I was aware of that, but again, in the last step it's not a inplace change anymore. This might seem like a minor issue, and in such a small example it surely is, but I'm using this in a larger context, the reference to the array would be passed around wildly, so I'd have to wrap it again, which I'd like to avoid. – Michael Jan 15 '14 at 12:39
  • Did you check the memory changes for the last edit? I used the large arrays (100e6) and it performed as if in place. You would have to create your own "resize" function to resize the base and reload the mmap into the memmap, yes. – wwwslinger Jan 16 '14 at 06:03
  • Great answer. To change the file on disk, only the `.base.resize()` method worked for me. Otherwise, loading with `np.memmap('bla.bin', dtype='int32', mode='r')` still remembered the old size. – Ataxias Feb 12 '21 at 04:29
4

If I'm not mistaken, this achieves essentially what @wwwslinger's second solution does, but without having to manually specify the size of the new memmap in bits:

In [1]: a = np.memmap('bla.bin', mode='w+', dtype=int, shape=(10,))

In [2]: a[3] = 7

In [3]: a
Out[3]: memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0])

In [4]: a.flush()

# this will append to the original file as much as is necessary to satisfy
# the new shape requirement, given the specified dtype
In [5]: new_a = np.memmap('bla.bin', mode='r+', dtype=int, shape=(20,))

In [6]: new_a
Out[6]: memmap([0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [7]: a[-1] = 10

In [8]: a
Out[8]: memmap([ 0,  0,  0,  7,  0,  0,  0,  0,  0, 10])

In [9]: a.flush()

In [11]: new_a
Out[11]: 
memmap([ 0,  0,  0,  7,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0])

This works well when the new array needs to be bigger than the old one, but I don't think this type of approach will allow for the size of the memory-mapped file to be automatically truncated if the new array is smaller.

Manually resizing the base, as in @wwwslinger's answer, seems to allow the file to be truncated, but it doesn't reduce the size of the array.

For example:

# this creates a memory mapped file of 10 * 8 = 80 bytes
In [1]: a = np.memmap('bla.bin', mode='w+', dtype=int, shape=(10,))

In [2]: a[:] = range(1, 11)

In [3]: a.flush()

In [4]: a
Out[4]: memmap([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

# now truncate the file to 40 bytes
In [5]: a.base.resize(5*8)

In [6]: a.flush()

# the array still has the same shape, but the truncated part is all zeros
In [7]: a
Out[7]: memmap([1, 2, 3, 4, 5, 0, 0, 0, 0, 0])

In [8]: b = np.memmap('bla.bin', mode='r+', dtype=int, shape=(5,))

# you still need to create a new np.memmap to change the size of the array
In [9]: b
Out[9]: memmap([1, 2, 3, 4, 5])
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • This is a similar approach like the one in the workaround that I had posted. I would prefer an inplace solution, as it would save me from encapsulating the object even further. Anyways, this is probably what I'll have to live with in the end. – Michael Jan 14 '14 at 10:36
  • @Michael If you haven't already, you should probably report this issue to the numpy maintainers. At the very least, the docstring for the `np.memmap` class should be updated to reflect the fact that it isn't currently possible to resize memory-mapped arrays in place. – ali_m Jan 14 '14 at 10:43
  • I haven't, but as it looks like there is no easy solution to this, I will. – Michael Jan 14 '14 at 11:00