In knowledge distillation, how to run the student and the teacher models in parallel?

Question

I am implementing fast DNN model training using knowledge distillation, as illustrated in the figure below, to run the teacher and student models in parallel.

I checked some popular repos like NervanaSystems/distiller and peterliht/knowledge-distillation-pytorch. They execute the forward operations of the student and teacher models step by step, i.e., not in parallel on different devices (GPU or CPU).

I am trying to speed up this training process to run the 2 models at the same time using multiple devices (e.g., loading one model on CPU and not interrupting the GPU training of the other model).

What is the proper way to run 2 models in parallel? Can I use Python multiprocessing library to start 2 processes for the 2 models, i.e., loading 2 model instances and running forward()? I am using MXNet but this is a general question for all ML frameworks.

Edit:
My plan is to put a light-weight pre-trained teacher model on CPU which only runs forward pass with frozen parameters.
The student model is a large model to be trained on GPU (distributedly). This task is not for model compression. I suppose moving a light task (teacher's forward pass) to CPU can increase the overlap and make this pipeline faster.
The idea is from a workshop paper: Infer2Train: leveraging inference for better training of deep networks.

score 0 · Answer 1 · answered Sep 11 '20 at 09:24

I am trying to speed up this training process to run the 2 models at the same time using multiple devices

I doubt that would bring any speed up, especially in case of:

(e.g., loading one model on CPU and not interrupting the GPU training of the other model).

as deep learning is a pipeline which also utilizes CPU, possibly multiple cores (say for data loading but also receiving metrics, gathering them etc.).

Furthermore CPU is rather ineffective for neural network training when compared to GPU/TPU unless you have some tailored CPU architecture (stuff like MobileNet). If you were to train student on CPU, you might significantly slow down pipeline elements of teacher.

What is the proper way to run 2 models in parallel?

Again, depending on the model, but it would be best to utilize 2 GPUs for training and split CPU cores for other tasks between them. In your case you would have to synchronize teacher and student predictions across two devices though.

Can I use Python multiprocessing library to start 2 processes for the 2 models, i.e., loading 2 model instances and running forward()?

PyTorch provides primitives (e.g. "their" multiprocessing wrapper, Futures etc.) which could possibly be used for that, not sure about mxnet or a-like.

Thanks for your quick reply. My plan is to put a pre-trained light-weight teacher model on CPU (suppose there is extra capacity) and make 2 models overlapped to avoid extra time introduced by the teacher. It's not for model compression but to explore some new training techniques. The teacher (light, on CPU) only conducts forward pass and its parameters are frozen. I will check PyTorch multiprocessing and try to use Python's multiprocessing on other frameworks. — Yiding, Sep 11 '20 at 10:29
@Yiding Unfortunately I'm not aware of any out of the box libraries/tools to do that except for primitives you would have to code yourself. For PyTorch you could start [here](https://pytorch.org/tutorials/beginner/dist_overview.html) if you want. — Szymon Maszke, Sep 11 '20 at 10:32

In knowledge distillation, how to run the student and the teacher models in parallel?

1 Answers1