0

Consider this example:

import numpy as np
a = np.array(1)
np.save("a.npy", a)

a = np.load("a.npy", mmap_mode='r')
print(type(a))

b = a + 2
print(type(b))

which outputs

<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>

So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?

I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.

Here is an extended example showing my problem:

import numpy as np

# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))

# I want to print the first value using f and memmaps


def f(value):
    print(value[1])


# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)

# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
bers
  • 4,817
  • 2
  • 40
  • 59

2 Answers2

1

This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.

There's a couple of ways to work around this. The simplest is to do all operations in place,

a += 1

This requires loading the memory mapped array for reading and writing,

a = np.load("a.npy", mmap_mode='r+')

Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.

b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)

Assigning can be done by using the out keyword provided by numpy ufuncs.

np.add(a, 2, out=b)
user2699
  • 2,927
  • 14
  • 31
  • This looks like an interesting solution, thank you. But do I understand this correctly that when you do `np.add(a, 2, out=b)`, `a` and `b` are read and written, respectively, in their entirety? This is what I would like to avoid: I actually have enough memory available, but I would like to increase the speed of my application by only lazily opening the data and later accessing only what is needed (without knowing beforehand what that might be). From that perspective, your solution looks even worse than mine (although I have to admit, I have not yet tested it with large data). – bers Sep 04 '18 at 07:54
  • You seem to have a misunderstanding of the purpose of `memmap`. If you're indexing `a` the code you've posted in your question will only load the items indexed, which seems to be what you're asking for. – user2699 Sep 04 '18 at 12:35
  • "If you're indexing `a` ..." - correct, but this is not what I am aiming at. I want to index `b` later, taking advantage of `b` as a (simple derivative of) a `memmap`, without having to read all of `a` from disk and then store all of `b` to disk. – bers Sep 05 '18 at 11:03
  • Hmm, can you include a more complete example of what you're trying to do? It seems like applying the index you're using on `b` to `a` might work, or is it more complex than that? – user2699 Sep 05 '18 at 13:20
  • Of course I could do that, yes, but that would require changing the design of the problem. Let's say I have two huuuge arrays stored on disk. I have some function `f`, taking one argument, that I want to apply to these two. However, one of the two arrays needs to be offset (by `+1`) before passed to `f`. Let's assume I have no way of changing the function `f`. – bers Sep 05 '18 at 14:23
  • 1
    I added another example. – bers Sep 05 '18 at 14:37
1

Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable. For a very specific case it may be easier than redesigning your code to solve the problem in a better way. I'd recommend reading over these examples from the docs to help understand how it works.

import numpy as np  
class Defered(np.ndarray):
      """
      An array class that deferrs calculations applied to it, only
      calculating them when an index is requested
      """
      def __new__(cls, arr):
            arr = np.asanyarray(arr).view(cls)
            arr.toApply = []
            return arr

      def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
            ## Convert all arguments to ndarray, otherwise arguments
            # of type Defered will cause infinite recursion
            # also store self as None, to be replaced later on
            newinputs = []
            for i in inputs:
                  if i is self:
                        newinputs.append(None)
                  elif isinstance(i, np.ndarray):
                        newinputs.append(i.view(np.ndarray))
                  else:
                        newinputs.append(i)

            ## Store function to apply and necessary arguments
            self.toApply.append((ufunc, method, newinputs, kwargs))
            return self

      def __getitem__(self, idx):
            ## Get index and convert to regular array
            sub = self.view(np.ndarray).__getitem__(idx)

            ## Apply stored actions
            for ufunc, method, inputs, kwargs in self.toApply:
                  inputs = [i if i is not None else sub for i in inputs]
                  sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)

            return sub

This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.

user2699
  • 2,927
  • 14
  • 31
  • 1
    This works perfectly, thank you! Right along the lines of what I had expected to work ("subclassing `ndarray`"), but I would not have been able to do that myself. Other beginners may use this by `b = Defered(np.load("memmap.npy", mmap_mode='r'))`. – bers Sep 06 '18 at 08:16