0

I would like to supply to a network many training images that are sampled from a dataset by following certain sampling rules. Now I have two choices:

  1. Use the sampling logic to generate a list of images offline, then convert the .lst file to .rec file and use an sequential DataIter to access it.

  2. Write my own child class of DataIter that can sample the images online. As a result, the class need to support random access, maybe inheriting from MXIndexedRecordIO. I will need to create a .rec file for the original dataset.

My intuition tells me that sequential access will be faster than random access for a .rec file. But I don't know if the difference is big enough to worth the additional time I spend in writing and testing my own iterator class. Could anyone give me a hint on this?

J.Doe
  • 3
  • 1
  • You probably want to shuffle the input to avoid repeating the same data, if you are using a random access iterator. If you do that you will have back a sequential access, which gives you options for optimization. – Guy Jul 15 '17 at 23:23

2 Answers2

0

Since this is a question about performance, I guess it depends on how fast your network can process images which in turn depends on what hardware you are running your training on.

0

In your case you are better off prepacking images using MXRecordIO. It will give you a boost of performance and also introduce consistency in how you handle the dataset.

It will store the files in a .rec file as a list, where order matters

You can then use mxnet.image.ImageIter to iterate over .rec in order.

http://mxnet.io/api/python/io.html#mxnet.image.ImageIter

Stanley Kirdey
  • 602
  • 5
  • 20