I have a very large dataset which is a single npy file that contains around 1.5m elements each a 150x150x3 image. The output has 51 columns (51 outputs). Since the dataset can't fit into memory, How do I load it and use it to fit the model? An efficient way is using TFRecords and tf.data but I couldn't understand how to do this. I would appreciate the help. Thank you.
Asked
Active
Viewed 1,698 times
3
-
What does _I couldn’t understand how to do this_ mean? Can you share your attempts? – AMC Dec 05 '19 at 20:47
-
@AlexanderCécile yeah sure, the idea is to convert the large dataset into a TensorFlow compatible format, TFRecord, and then use the tf.data API to read this tfrecord file to feed it to the neural network. I tried various approaches but failed to do it – Amin Marshal Dec 05 '19 at 21:11
1 Answers
3
One way is to load your NPY file fragment by fragment ( to feed your neural network with) and not to load it into the memory at once. You can use numpy.load
as normal and specify the mmap_mode
keyword so that the array is kept on disk, and only necessary bits are loaded into memory upon access (more details here)
numpy.load(file, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.
If you want to know how to create a tfrecords
from a numpy
array, and then read the tfrecords
using the Dataset API, this link provides a good answer.

aminrd
- 4,300
- 4
- 23
- 45
-
-
1I have some questions about the link you provided for TFRecords. Why was X flattened? My numpy arrays are image arrays and I have 51 outputs for y. Do I also need to flatten them? moreover, when I try this code, ram goes as high as 90% (I have 32GB RAM) and the program crashes. Can you identify the problem? – Amin Marshal Dec 06 '19 at 17:36