0

My mxnet script is likely limited by i/o of data loading into the GPU, and I am trying to speed this up by prefetching. The trouble is I can't figure out how to prefetch with a custom data iterator.

My first hypothesis/hope was that it would be enough to set the values of self.preprocess_threads and self.prefetch_buffer, as I had seen here for iterators such as mxnet.io.ImageRecordUInt8Iter. However, when I did this I saw no performance change relative to the script before I had set these variables, so clearly setting these did not work.

Then I noticed, the existence of a class mx.io.PrefetchingIter in addition to the base class for which I had implemented a child class mx.io.DataIter. I found this documentation, but I have not been able to find any examples, and I am a little confused about what needs to happen where/when. However, I am not clear on how to use this. For example. I see that in addition to next() it has an iter_next() method, which simply says "move to the next batch". What does this mean exactly? What does it mean to "move" to the next batch without producing it? I found the source code for this class, and based on a brief reading, it seems as though it takes multiple iterators and creates one thread per iterator. This likely would not work for my current design, as I really want multiple threads used to prefetch from the same iterator.

Here is what I am trying to do via a custom data iterator

  1. I maintain a global multiprocessing.Queue on which I pop data as it becomes available
  2. I produce that data by running (via multiprocessing) a command line script that executes a c++ binary which produces a numpy file
  3. I open the numpy file and load its contents into memory, process them, and put the processed bits on the global multiprocessing.Queue
  4. My custom iterator pulls off this queue and also kicks off more jobs to produce more data when the queue is empty.

Here is my code:

def launchJobForDate(date_str):
### this is a function that gets called via multiprocessing
### to produce new data by calling a c++ binary
### whenever data queue is empty so that we need to produce more data
    try:
        f = "testdata/data%s.npy"%date_str
        if not os.path.isfile(f):
            cmd = CMD % ( date_str, JSON_FILE, date_str, date_str, date_str)
            while True:
                try:
                    output = subprocess.check_output(cmd, shell=True)
                    break
                except:
                    pass
        while True:
            try:
                d = np.load(f)
                break
            except:
                pass
        data_queue.put((d, date_str))
    except Exception as ex:
        print("launchJobForDate: ERROR ", ex)

class ProduceDataIter(mx.io.DataIter):
    @staticmethod
    def processData(d, time_steps, num_inputs):
       try: 
            ...processes data...
            return [z for z in zip(bigX, bigY, bigEvalY, dates)]
        except Exception as ex:
            print("processData: ERROR ", ex)

    def __init__(self, num_mgrs, end_date_str):
        ## iter stuff
        self.preprocess_threads = 4
        self.prefetch_buffer = 1

        ## set up internal data to preserve state
        ## and make a list of dates for which to run binary

    @property
    def provide_data(self):
        return [mx.io.DataDesc(name='seq_var', 
                               shape=(args_batch_size * GPU_COUNT, 
                                      self.time_steps, 
                                      self.num_inputs), 
                               layout='NTC')]

    @property
    def provide_label(self):
        return [mx.io.DataDesc(name='bd_return', 
                                shape=(args_batch_size * GPU_COUNT)),             
                mx.io.DataDesc(name='bd_return', 
                                shape=(args_batch_size * GPU_COUNT, num_y_cols)), 
                mx.io.DataDesc(name='date', 
                               shape=(args_batch_size * GPU_COUNT))]                 


    def __next__(self):
        try:
            z = self.z.pop(0)       
            data = z[0:1]
            label = z[1:]
            return mx.io.DataBatch(data, label) 
        except Exception as ex:
            ### if self.z (a list) has no elements to pop we need
            ### to get more data off the queue, process it, and put it
            ### on self.x so it's ready for calls to __next__()
            while True:
                try:
                    d = data_queue.get_nowait()
                    processedData = ProduceDataIter.processData(d, 
                                                            self.time_steps, 
                                                            self.num_inputs)
                    self.z.extend(processedData)
                    counter_queue.put(counter_queue.get() - 1)

                    z = self.z.pop(0)
                    data = z[0:1]
                    label = z[1:]
                    return mx.io.DataBatch(data, label)

                except queue.Empty:
                    ...this is where new jobs to produce new data and put them 
                    ...on the queue would happen if nothing is left on the queue

I have then tried making one of these iterators as well as a prefetch iterator like so:

mgr      = ProcessMgr(2, end_date_str)
mgrOuter = mx.io.PrefetchingIter([mgr])

The problem is that mgrOuter immediately throws a StopIteration as soon as __next__() is called the first time, and without invoking mgr.__next__() as I thought it might.

Finally, I also noticed that gluon has a DataLoader object which seems like it might handle prefetching, however in this case it also seems to assume that the underlying data is from a Dataset which has a finite and unchanging layout (based on the fact that it is implemented in terms of getitem, which takes an index). So I have not pursued this option as it seem unpromising given the dynamic queue-like nature of the data I am generating as training input.

My questions are:

  • How do I need to modify my code above so that there will be prefetching for my custom iterator?
  • Where might I find an example or more detailed documentation of how mx.io.PrefetchingIter works?
  • Are there other strategies I should be aware of for getting more performance out of my GPUs via a custom iterator? Right now they are only operating at around 50% capacity, and upping (or lowering) the batch size doesn't change this. What other knobs might I be able to turn to increase GPU use efficiency?

Thanks for any feedback and advice.

TFdoe
  • 571
  • 5
  • 16

1 Answers1

0

As you already mentioned, gluon DataLoader is providing prefetching. In your custom DataIterator, you are using Numpy arrays as input. So you could do the following:

f = "testdata/data%s.npy"%date_str
data = np.load(f)
train = gluon.data.ArrayDataset(mx.nd.array(data))
train_iter = gluon.data.DataLoader(train, shuffle=True, num_workers=4, batch_size=batch_size, last_batch='rollover')

Since you are creating your data dynamically, you could try resetting the DataLoader in every epoch and load a new Numpy array. If GPU utilization is still low, then try to increase the batch_size and the num_workers. Another problem could be also the size of your dataset. Resetting the DataLoader will impact the performance, so having a larger dataset will increase the time of an epoch and as such increase performance.

NRauschmayr
  • 131
  • 1