3

I want to load a .npy from google storage (gs://project/file.npy) into my google ml-job as training data. Since the file is +10GB big, I want to use the mmap_mode option of numpy.load() to not run out of memory.

Background: I use Keras with fit_generator and Keras Sequence to load batches of data from the .npy that is stored on google storage.

To access google storage I'm using BytesIO since not every lib can access google storage. This code works fine without mmap_mode = 'r':

from tensorflow.python.lib.io import file_io
from io import BytesIO

filename = 'gs://project/file'

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file)

If I activate mmap_mode, I get this error:

TypeError: expected str, bytes or os.PathLike object, not BytesIO

I don't understand why it now doesn't accept the BytesIO anymore.

Code including mmap_mode:

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file, mmap_mode = 'r')

Trace:

File "[...]/numpy/lib/npyio.py", line 444, in load return format.open_memmap(file, mode=mmap_mode) File "[...]/numpy/lib/format.py", line 829, in open_memmap fp = open(os_fspath(filename), 'rb') File "[...]/numpy/compat/py3k.py", line 237, in os_fspath "not " + path_type.name) TypeError: expected str, bytes or os.PathLike object, not BytesIO

DΦC__WTF
  • 105
  • 1
  • 9
  • 2
    Look at the docs (or code) of `np.lib.npyio.format.open_memmap`. It says `The name of the file on disk. This may *not* be a file-like object`. After dealing with the `save/load` header, this code uses `np.memmap`, so is limited to what that can handle. – hpaulj Dec 30 '19 at 18:04
  • 1
    In memmap mode, the file is accessed 'randomly'. In ordinary load, access is sequential - one byte after another without any sort of backtrack or seek. – hpaulj Dec 30 '19 at 18:11
  • Isn't the biggest issue that tensorflows fileIO.read_file_to_string() reads the whole file? So this won't work with memmap anyway, right? – DΦC__WTF Dec 30 '19 at 20:51
  • @hpaulj do you have any idea which lib I could use instead? I also don't get your comment with random and sequential access – DΦC__WTF Dec 30 '19 at 20:52
  • 2
    Would be possible to split your dataset so you can upload it and process it stepwise? – Happy-Monad Dec 31 '19 at 13:13
  • 1
    Also, could you check this [thread](https://stackoverflow.com/questions/14248333/google-cloud-storage-seeking-within-files) to see if the idea there fits your needs? – Happy-Monad Dec 31 '19 at 13:21
  • 1
    Yes I can split the file even more, I currently have 5x 10GB files. – DΦC__WTF Dec 31 '19 at 14:52
  • 1
    How easy would it be to manually read parts of the .npy file using the range header mentioned in your thread. The data structure in the .npy file is a [10000:50:50:50:1] and the 10000 dimension can be split in smaller files. – DΦC__WTF Dec 31 '19 at 14:55
  • 2
    If you can split the file even more that would be the best option from my perspective. If you can avoid running out of memory this way the problem get's solved immediately. To be fair I don't know how to apply the range header to your use case but thought it was good idea to share the information. – Happy-Monad Jan 03 '20 at 15:50

1 Answers1

0

You can pass from BytesIO to bytes using b.getvalue()

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file.getvalue(), mmap_mode = 'r')
Juancki
  • 1,793
  • 1
  • 14
  • 21