I am implementing fast DNN model training using knowledge distillation, as illustrated in the figure below, to run the teacher and student models in parallel.
I checked some popular repos like NervanaSystems/distiller and peterliht/knowledge-distillation-pytorch. They execute the forward operations of the student and teacher models step by step, i.e., not in parallel on different devices (GPU or CPU).
I am trying to speed up this training process to run the 2 models at the same time using multiple devices (e.g., loading one model on CPU and not interrupting the GPU training of the other model).
What is the proper way to run 2 models in parallel? Can I use Python multiprocessing
library to start 2 processes for the 2 models, i.e., loading 2 model instances and running forward()
? I am using MXNet but this is a general question for all ML frameworks.
Edit:
My plan is to put a light-weight pre-trained teacher model on CPU which only runs forward pass with frozen parameters.
The student model is a large model to be trained on GPU (distributedly).
This task is not for model compression.
I suppose moving a light task (teacher's forward pass) to CPU can increase the overlap and make this pipeline faster.
The idea is from a workshop paper: Infer2Train: leveraging inference for better training of deep networks.