I would expect DataLoader to load batches concurrently to the main process, filling up the buffer as soon as a batch is consumed from the buffer. However, when I track the utilization of GPU and order of loading vs execution I see some different behavior:
- loading of the whole buffer (expected)
- consuming the whole buffer by execution of all the batches until the buffer is empty (not expected)
- loading the whole buffer again without parallel execution (not expected)
- goto 2.
This obviously results in dips in GPU utilization when in step 3.
I set:
num_workers >= 1
pin_memory = True/False (doesn't influence the described behavior)
Did anyone experience the same? What could be the issue?