Handling large numpy array in tensorflow with regression output(51 outputs)

Question

I have a very large dataset which is a single npy file that contains around 1.5m elements each a 150x150x3 image. The output has 51 columns (51 outputs). Since the dataset can't fit into memory, How do I load it and use it to fit the model? An efficient way is using TFRecords and tf.data but I couldn't understand how to do this. I would appreciate the help. Thank you.

What does _I couldn’t understand how to do this_ mean? Can you share your attempts? — AMC, Dec 05 '19 at 20:47
@AlexanderCécile yeah sure, the idea is to convert the large dataset into a TensorFlow compatible format, TFRecord, and then use the tf.data API to read this tfrecord file to feed it to the neural network. I tried various approaches but failed to do it — Amin Marshal, Dec 05 '19 at 21:11

aminrd · Accepted Answer · 2019-12-05T22:08:32.830

3

One way is to load your NPY file fragment by fragment ( to feed your neural network with) and not to load it into the memory at once. You can use numpy.load as normal and specify the mmap_mode keyword so that the array is kept on disk, and only necessary bits are loaded into memory upon access (more details here)

numpy.load(file, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.

If you want to know how to create a tfrecords from a numpy array, and then read the tfrecords using the Dataset API, this link provides a good answer.

edited Dec 05 '19 at 22:08

answered Dec 05 '19 at 20:55

aminrd

4,300
4
23
45

Thanks a lot, I will try both methods – Amin Marshal Dec 05 '19 at 21:11
1

I have some questions about the link you provided for TFRecords. Why was X flattened? My numpy arrays are image arrays and I have 51 outputs for y. Do I also need to flatten them? moreover, when I try this code, ram goes as high as 90% (I have 32GB RAM) and the program crashes. Can you identify the problem? – Amin Marshal Dec 06 '19 at 17:36

Handling large numpy array in tensorflow with regression output(51 outputs)

1 Answers1

Linked