Hi have a data processing pipeline and I want to optimise it by running some processing threads on the CPUs while running MXNet prediction model on the GPU at the same time (Python 3.6).
The idea that I would like to apply is the following (suppose that I have N GPUs on my machine):
- A GPU Job Dispatcher reads a sequence of N frames from a video, and send each frame to one GPU.
- Each GPU process its frame and predict its content with MXNet.
- Once all N GPUs has finished their predictions, I want at the same time to:
- Send the prediction output to a queue.
- read and process the next N frames in the GPU.
- The queue is consumed by a multithreaded process that runs on CPU.
Here is a visual description to the workflow:
The idea is to take advantage of the idle CPUs while the GPUs are busy processing the frames.
By using the threading library, I was successful in reading and processing the first N frames, but the GPUs are not able to process the next batch of frames.
Please note, the source code below is simplified to clarify the workflow.
Here is the code of the function that reads the frames and dispatches them to the gpus and then sends the outputs to the cpu queue:
def dispatch_jobs(video_capture, detection_workers, number_of_gpu, cpu_queue):
# detection_workers is a list of N similar MXNet models, each one works on a different GPU
is_last_frame = False
while not is_last_frame:
frames_batch = []
for i in range(0, number_of_gpu):
success, frame = read_frame_from_video(video_capture)
if not success:
logging.warning("Can't receive frame. Exiting.")
is_last_frame = True
break
frames_batch.append(frame)
workers = []
for detection_worker_id in range(0, len(frames_batch)):
frame_image = frames_batch[detection_worker_id]
thread = Thread(target=detection_workers[detection_worker_id].predict, kwargs={'image': frame_image})
workers.append(thread)
for w in workers: w.start()
for w in workers: w.join()
# sending to the CPU queue
for detection_worker_id in range(0, len(frames_batch)):
detector_output = detection_workers[detection_worker_id].output
cpu_queue.put(detector_output)
logging.info("While loop is broken... putting -1 in the queue")
cpu_queue.put(-1)
return
As explained above, there is a consumer thread that reads the outputs from the cpu_queue
, and send them to a multi-threaded function (on CPU), here is the code of the consumption function:
def consume_cpu_queue(cpu_queue):
while cpu_queue.empty():
logging.info("Sleeping 1 second")
time.sleep(1)
prediction_output = cpu_queue.get()
if prediction_output == -1:
return
process_output_multithread(prediction_output)
consume_cpu_queue()
def process_output_multithread(pred_output, number_of_process):
workers = []
for i in range(0, number_of_process):
thread = Thread(target=process, kwargs={'pred_output': pred_output})
workers.append(thread)
for w in workers: w.start()
for w in workers: w.join()
return
# Here is how the consumer thread is initiated
cpu_consumer_thread = Thread(target=consume_cpu_queue)
# Here is how I run the application
cpu_consumer_thread.start()
dispatch_jobs(video_capture, detection_workers)
cpu_consumer_thread.join()
I have have checked this question, but I am not sure if Numba can solve my issue.
Any suggestion or pointer would be greatly helpful.