Numpy load a memory-mapped array (mmap_mode) from google cloud storage

Question

I want to load a .npy from google storage (gs://project/file.npy) into my google ml-job as training data. Since the file is +10GB big, I want to use the mmap_mode option of numpy.load() to not run out of memory.

Background: I use Keras with fit_generator and Keras Sequence to load batches of data from the .npy that is stored on google storage.

To access google storage I'm using BytesIO since not every lib can access google storage. This code works fine without mmap_mode = 'r':

from tensorflow.python.lib.io import file_io
from io import BytesIO

filename = 'gs://project/file'

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file)

If I activate mmap_mode, I get this error:

TypeError: expected str, bytes or os.PathLike object, not BytesIO

I don't understand why it now doesn't accept the BytesIO anymore.

Code including mmap_mode:

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file, mmap_mode = 'r')

Trace:

File "[...]/numpy/lib/npyio.py", line 444, in load return format.open_memmap(file, mode=mmap_mode) File "[...]/numpy/lib/format.py", line 829, in open_memmap fp = open(os_fspath(filename), 'rb') File "[...]/numpy/compat/py3k.py", line 237, in os_fspath "not " + path_type.name) TypeError: expected str, bytes or os.PathLike object, not BytesIO

Look at the docs (or code) of `np.lib.npyio.format.open_memmap`. It says `The name of the file on disk. This may *not* be a file-like object`. After dealing with the `save/load` header, this code uses `np.memmap`, so is limited to what that can handle. — hpaulj, Dec 30 '19 at 18:04
In memmap mode, the file is accessed 'randomly'. In ordinary load, access is sequential - one byte after another without any sort of backtrack or seek. — hpaulj, Dec 30 '19 at 18:11
Isn't the biggest issue that tensorflows fileIO.read_file_to_string() reads the whole file? So this won't work with memmap anyway, right? — DΦC__WTF, Dec 30 '19 at 20:51
@hpaulj do you have any idea which lib I could use instead? I also don't get your comment with random and sequential access — DΦC__WTF, Dec 30 '19 at 20:52
Would be possible to split your dataset so you can upload it and process it stepwise? — Happy-Monad, Dec 31 '19 at 13:13
Also, could you check this [thread](https://stackoverflow.com/questions/14248333/google-cloud-storage-seeking-within-files) to see if the idea there fits your needs? — Happy-Monad, Dec 31 '19 at 13:21
Yes I can split the file even more, I currently have 5x 10GB files. — DΦC__WTF, Dec 31 '19 at 14:52
How easy would it be to manually read parts of the .npy file using the range header mentioned in your thread. The data structure in the .npy file is a [10000:50:50:50:1] and the 10000 dimension can be split in smaller files. — DΦC__WTF, Dec 31 '19 at 14:55
If you can split the file even more that would be the best option from my perspective. If you can avoid running out of memory this way the problem get's solved immediately. To be fair I don't know how to apply the range header to your use case but thought it was good idea to share the information. — Happy-Monad, Jan 03 '20 at 15:50

score 0 · Answer 1 · answered Jan 06 '20 at 09:48

0

You can pass from BytesIO to bytes using b.getvalue()

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file.getvalue(), mmap_mode = 'r')

answered Jan 06 '20 at 09:48

Juancki

1,793
1
14
21

Numpy load a memory-mapped array (mmap_mode) from google cloud storage

1 Answers1