I generated about 500
sharded numpy data files, each of them contains about 10000
data samples (e.g., image and its label), for example:
file-000001.npy
file-000002.npy
file-000003.npy
...
file-000500.npy
each of .npy
contains a numpy dictionary whose keys and size are {'image':10000x3x512x64 (dtype=np.float32),'label':10000x100 (dtype=np.float32)}
. Please note that some of these numpy files contain less than 10000
samples, say 8111
etc.
During training, for each epoch, we need to iterate all the 500x10000
samples. These data cannot be loaded into memory due to capacity limits. A common solution is data prefetching queue.
My thought is as follows: (1) first record all the filenames and the count of data samples in each file, (2) for each batch, compute the batch indices, then get the corresponding data files that are needed to be loaded into memory to read the data samples to compose the batch data.
During step (2), if we set the batch size as 256
, it is possible that we need to read 256
files and read just one sample in each of them to compose the batch data. This might be slow and unpractical.
Based on the queue, the data loading might be running on background threads, and all readed batch data are saved in the queue (the capacity might be a large number depends on the memory capacity). And the background threads consistently read data to fill the queue once after the queue have space.
Is it hard to implement this? I've searched in Google, it seems there are some advanced solutions such as using cache
technique, using mmap
. But I'm not familiar with these guys. Are there any simple examples on this?