My mxnet script is likely limited by i/o of data loading into the GPU, and I am trying to speed this up by prefetching. The trouble is I can't figure out how to prefetch with a custom data iterator.
My first hypothesis/hope was that it would be enough to set the values of self.preprocess_threads and self.prefetch_buffer, as I had seen here for iterators such as mxnet.io.ImageRecordUInt8Iter
. However, when I did this I saw no performance change relative to the script before I had set these variables, so clearly setting these did not work.
Then I noticed, the existence of a class mx.io.PrefetchingIter
in addition to the base class for which I had implemented a child class mx.io.DataIter
. I found this documentation, but I have not been able to find any examples, and I am a little confused about what needs to happen where/when. However, I am not clear on how to use this. For example. I see that in addition to next()
it has an iter_next()
method, which simply says "move to the next batch". What does this mean exactly? What does it mean to "move" to the next batch without producing it? I found the source code for this class, and based on a brief reading, it seems as though it takes multiple iterators and creates one thread per iterator. This likely would not work for my current design, as I really want multiple threads used to prefetch from the same iterator.
Here is what I am trying to do via a custom data iterator
- I maintain a global
multiprocessing.Queue
on which I pop data as it becomes available - I produce that data by running (via
multiprocessing
) a command line script that executes a c++ binary which produces anumpy
file - I open the
numpy
file and load its contents into memory, process them, and put the processed bits on the globalmultiprocessing.Queue
- My custom iterator pulls off this queue and also kicks off more jobs to produce more data when the queue is empty.
Here is my code:
def launchJobForDate(date_str):
### this is a function that gets called via multiprocessing
### to produce new data by calling a c++ binary
### whenever data queue is empty so that we need to produce more data
try:
f = "testdata/data%s.npy"%date_str
if not os.path.isfile(f):
cmd = CMD % ( date_str, JSON_FILE, date_str, date_str, date_str)
while True:
try:
output = subprocess.check_output(cmd, shell=True)
break
except:
pass
while True:
try:
d = np.load(f)
break
except:
pass
data_queue.put((d, date_str))
except Exception as ex:
print("launchJobForDate: ERROR ", ex)
class ProduceDataIter(mx.io.DataIter):
@staticmethod
def processData(d, time_steps, num_inputs):
try:
...processes data...
return [z for z in zip(bigX, bigY, bigEvalY, dates)]
except Exception as ex:
print("processData: ERROR ", ex)
def __init__(self, num_mgrs, end_date_str):
## iter stuff
self.preprocess_threads = 4
self.prefetch_buffer = 1
## set up internal data to preserve state
## and make a list of dates for which to run binary
@property
def provide_data(self):
return [mx.io.DataDesc(name='seq_var',
shape=(args_batch_size * GPU_COUNT,
self.time_steps,
self.num_inputs),
layout='NTC')]
@property
def provide_label(self):
return [mx.io.DataDesc(name='bd_return',
shape=(args_batch_size * GPU_COUNT)),
mx.io.DataDesc(name='bd_return',
shape=(args_batch_size * GPU_COUNT, num_y_cols)),
mx.io.DataDesc(name='date',
shape=(args_batch_size * GPU_COUNT))]
def __next__(self):
try:
z = self.z.pop(0)
data = z[0:1]
label = z[1:]
return mx.io.DataBatch(data, label)
except Exception as ex:
### if self.z (a list) has no elements to pop we need
### to get more data off the queue, process it, and put it
### on self.x so it's ready for calls to __next__()
while True:
try:
d = data_queue.get_nowait()
processedData = ProduceDataIter.processData(d,
self.time_steps,
self.num_inputs)
self.z.extend(processedData)
counter_queue.put(counter_queue.get() - 1)
z = self.z.pop(0)
data = z[0:1]
label = z[1:]
return mx.io.DataBatch(data, label)
except queue.Empty:
...this is where new jobs to produce new data and put them
...on the queue would happen if nothing is left on the queue
I have then tried making one of these iterators as well as a prefetch iterator like so:
mgr = ProcessMgr(2, end_date_str)
mgrOuter = mx.io.PrefetchingIter([mgr])
The problem is that mgrOuter
immediately throws a StopIteration
as soon as __next__()
is called the first time, and without invoking mgr.__next__()
as I thought it might.
Finally, I also noticed that gluon
has a DataLoader
object which seems like it might handle prefetching, however in this case it also seems to assume that the underlying data is from a Dataset
which has a finite and unchanging layout (based on the fact that it is implemented in terms of getitem
, which takes an index). So I have not pursued this option as it seem unpromising given the dynamic queue-like nature of the data I am generating as training input.
My questions are:
- How do I need to modify my code above so that there will be prefetching for my custom iterator?
- Where might I find an example or more detailed documentation of how mx.io.PrefetchingIter works?
- Are there other strategies I should be aware of for getting more performance out of my GPUs via a custom iterator? Right now they are only operating at around 50% capacity, and upping (or lowering) the batch size doesn't change this. What other knobs might I be able to turn to increase GPU use efficiency?
Thanks for any feedback and advice.