0

I generated about 500 sharded numpy data files, each of them contains about 10000 data samples (e.g., image and its label), for example:

    file-000001.npy
    file-000002.npy
    file-000003.npy
    ...
    file-000500.npy

each of .npy contains a numpy dictionary whose keys and size are {'image':10000x3x512x64 (dtype=np.float32),'label':10000x100 (dtype=np.float32)}. Please note that some of these numpy files contain less than 10000 samples, say 8111 etc.

During training, for each epoch, we need to iterate all the 500x10000 samples. These data cannot be loaded into memory due to capacity limits. A common solution is data prefetching queue.

My thought is as follows: (1) first record all the filenames and the count of data samples in each file, (2) for each batch, compute the batch indices, then get the corresponding data files that are needed to be loaded into memory to read the data samples to compose the batch data.

During step (2), if we set the batch size as 256, it is possible that we need to read 256 files and read just one sample in each of them to compose the batch data. This might be slow and unpractical.

Based on the queue, the data loading might be running on background threads, and all readed batch data are saved in the queue (the capacity might be a large number depends on the memory capacity). And the background threads consistently read data to fill the queue once after the queue have space.

Is it hard to implement this? I've searched in Google, it seems there are some advanced solutions such as using cache technique, using mmap. But I'm not familiar with these guys. Are there any simple examples on this?

mining
  • 3,557
  • 5
  • 39
  • 66
  • Check similar example: https://stackoverflow.com/questions/45427637/is-there-a-more-simple-way-to-handle-batch-inputs-from-tfrecords/45428167#45428167 – Vijay Mariappan Aug 14 '17 at 17:08
  • Hi, @vijaym, thanks for kind comment! I always found the queue runner has strange manners (e.g., the queue runner always return different number of samples, how is the random shuffling truly doing in the queue runner), and I'm not familiar with the internal mechanisms and it is difficult for me to control the data flow or debug. If I could write a data prefetching queue by myself, I could control the data fetching more confidently. – mining Aug 14 '17 at 17:34
  • @vijaym, another thought is that the tensorflow only supports the TFRecords well, and when training the networks, we always need to first run `image_val, label_val=sess.run([image_op, label_op])`, but the `image_val` and `label_val` are numpy arrays, I don't know why we need to do this. Now that we always need to first load the batch data as numpy arrays, why don't we using the pure python functions and queues? Yeah, the `Dataset` APIs are good encapsulations for these things. But I think we could make things simpler. – mining Aug 14 '17 at 17:39
  • @vijaym, just like this: https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/roi_data_layer/layer.py#L161 – mining Aug 14 '17 at 17:40
  • Whatever you want to do can be achieved using the dataset API. Check: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/programmers_guide/datasets.md. And doing `image_val, label_val=sess.run([image_op, label_op])` is the inefficient way, check: https://stackoverflow.com/questions/44862754/tensorflow-using-an-input-pipeline-csv-as-a-dictionary-for-training/44862968#44862968. – Vijay Mariappan Aug 14 '17 at 18:02
  • @vijaym, got it, thanks a lot for your kind suggestions! – mining Aug 14 '17 at 18:26

0 Answers0