3

I have a very large dataset which is a single npy file that contains around 1.5m elements each a 150x150x3 image. The output has 51 columns (51 outputs). Since the dataset can't fit into memory, How do I load it and use it to fit the model? An efficient way is using TFRecords and tf.data but I couldn't understand how to do this. I would appreciate the help. Thank you.

Amin Marshal
  • 183
  • 2
  • 10
  • What does _I couldn’t understand how to do this_ mean? Can you share your attempts? – AMC Dec 05 '19 at 20:47
  • @AlexanderCécile yeah sure, the idea is to convert the large dataset into a TensorFlow compatible format, TFRecord, and then use the tf.data API to read this tfrecord file to feed it to the neural network. I tried various approaches but failed to do it – Amin Marshal Dec 05 '19 at 21:11

1 Answers1

3

One way is to load your NPY file fragment by fragment ( to feed your neural network with) and not to load it into the memory at once. You can use numpy.load as normal and specify the mmap_mode keyword so that the array is kept on disk, and only necessary bits are loaded into memory upon access (more details here)

numpy.load(file, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.

If you want to know how to create a tfrecords from a numpy array, and then read the tfrecords using the Dataset API, this link provides a good answer.

aminrd
  • 4,300
  • 4
  • 23
  • 45
  • Thanks a lot, I will try both methods – Amin Marshal Dec 05 '19 at 21:11
  • 1
    I have some questions about the link you provided for TFRecords. Why was X flattened? My numpy arrays are image arrays and I have 51 outputs for y. Do I also need to flatten them? moreover, when I try this code, ram goes as high as 90% (I have 32GB RAM) and the program crashes. Can you identify the problem? – Amin Marshal Dec 06 '19 at 17:36