12

I have trained a CNN model on GPU using FastAI (PyTorch backend). I am now trying to use that model for inference on the same machine, but using CPU instead of GPU. Along with that, I am also trying to make use of multiple CPU cores using the multiprocessing module. Now here is the issue,

Running the code on single CPU (without multiprocessing) takes only 40 seconds to process nearly 50 images

Running the code on multiple CPUs using torch multiprocessing takes more than 6 minutes to process the same 50 images

from torch.multiprocessing import Pool, set_start_method
os.environ['CUDA_VISIBLE_DEVICES']=""
from fastai.vision import *
from fastai.text import *
defaults.device = torch.device('cpu')

def process_image_batch(batch):

    learn_cnn  = load_learner(scripts_folder, 'cnn_model.pkl')
    learn_cnn.model.training = False    
    learn_cnn.model = learn_cnn.model.eval()
    # for image in batch: 
    #     prediction = ... # predicting the image here
    #     return prediction

if __name__ == '__main__':
    #
    # image_batches = ..... # retrieving the image batches (It is a list of 5 lists)
    # n_processes = 5
    set_start_method('spawn', force=True)
    try:
        pool = Pool(n_processes)
        pool.map(process_image_batch, image_batches)
    except Exception as e:
        print('Main Pool Error: ', e)
    except KeyboardInterrupt:
        exit()
    finally:
        pool.terminate()
        pool.join()

I am not sure what's causing this slowdown in multiprocessing mode. I've read a lot of posts discussing similar issue but couldn't find a proper solution anywhere.

asanoop24
  • 449
  • 4
  • 13
  • 1
    DId you figure it out ? am having the same issue. Also cant find a proper solution. I did realize that PyTorch somehow tries to use all the available CPUs in every process and this might be causing these massive slow downs, am not sure though... – AnarKi Sep 03 '20 at 16:02
  • 3
    @AnarKi Yes. I had to force pytorch to use only 1 thread per process. `torch.set_num_threads(1)` – asanoop24 Sep 04 '20 at 17:17
  • Using only one thread helped me in a very similar situation as well. Thanks for the info. You might want to post that (and accept) as an answer to your own question. – Matthias Apr 28 '21 at 14:04

2 Answers2

5

I think you have done a very naive mistake here, you are reading the model object in the function which you are parallelizing.

Meaning for every single image, you are reloading the model from the disk. Depending on your model object size, IO is gonna be more time consuming then running a forward step.

Please consider reading the model once in the main thread and then make the object available for inference in the parallel function.

tejas
  • 143
  • 8
  • @tejas...yes that was my bad. I am actually using batches instead of single image. Was just trying to keep the code simple here but have updated it now. So the model is getting loaded for each batch instead of each image but is still showing poor results. – asanoop24 Sep 30 '19 at 16:48
  • So you could do one naive thing, Let's assume you have 8 cores and 1600 images to infer. What you do is split the data in 8 equal part i.2 200 files each. Now write a function that loads the model object, and run inference on the 200 files. At last using multiprocessing create 8 worker process and parallelize the function on 8 chunk of your 1600 files. This way you would only load the model only 8 times in each process – tejas Dec 23 '20 at 12:21
1

The solution turned out to be forcing pytorch to use only 1 thread per process as below

torch.set_num_threads(1)

asanoop24
  • 449
  • 4
  • 13