I/O performance difference for sequential vs random acess with MxNet data iterators?

Question

I would like to supply to a network many training images that are sampled from a dataset by following certain sampling rules. Now I have two choices:

Use the sampling logic to generate a list of images offline, then convert the .lst file to .rec file and use an sequential DataIter to access it.
Write my own child class of DataIter that can sample the images online. As a result, the class need to support random access, maybe inheriting from MXIndexedRecordIO. I will need to create a .rec file for the original dataset.

My intuition tells me that sequential access will be faster than random access for a .rec file. But I don't know if the difference is big enough to worth the additional time I spend in writing and testing my own iterator class. Could anyone give me a hint on this?

You probably want to shuffle the input to avoid repeating the same data, if you are using a random access iterator. If you do that you will have back a sequential access, which gives you options for optimization. — Guy, Jul 15 '17 at 23:23

score 0 · Answer 1 · answered Jul 18 '17 at 17:05

0

Since this is a question about performance, I guess it depends on how fast your network can process images which in turn depends on what hardware you are running your training on.

answered Jul 18 '17 at 17:05

madan jampani

46
1

score 0 · Accepted Answer · answered Jul 22 '17 at 03:50

In your case you are better off prepacking images using MXRecordIO. It will give you a boost of performance and also introduce consistency in how you handle the dataset.

It will store the files in a .rec file as a list, where order matters

You can then use mxnet.image.ImageIter to iterate over .rec in order.

http://mxnet.io/api/python/io.html#mxnet.image.ImageIter

I/O performance difference for sequential vs random acess with MxNet data iterators?

2 Answers2