parallelizing with joblib.Parallel() on CPU fails after running a parallelized CUDA-based code on GPU

Question

I have a Python-based machine learning code that runs three algorithms on my data: Random Forest (implementation in scikit-learn), Gradient Boosting (implementation in XGBoost), and Recurrent Neural Network (implementation in Theano/Keras). The first two are running on the CPU and are parallelized using joblib.Parallel() and the latter runs on the GPU using CUDA. I pass the name of the algorithm I want to use (RF, XGB, NN) in the variable method to a function, and it will proceed with running a parallel implementation of each algorithm.

from joblib import Parallel, delayed

if method in ['RF', 'XGB']:
    with Parallel(n_jobs = 8) as parallel_param:
        ...
        model_scores = parallel_param(delayed(model_selection_loop)(p_idx, ..., method, ...) for p_idx in range(num_parameter_samples))
        ...
elif method == 'NN':
    ...
    model_scores = []
    for p_idx in range(num_parameter_samples):
        model_scores.append(model_selection_loop(p_idx, ..., method, ...))
    ...
else:
    raise ValueError("Unknown algorithm!")

model_selection_loop() is a function that performs nested cross-validation to select the best performing hyper-parameters and to estimate the performance of the selected model on new data, and num_parameter_samples is the number of different hyper-parameter configurations to be sampled from a grid (I use, say, 30 different configurations).

If the run order is (1) RF, (2) XGB, (3) NN, everything runs fine; the first two are parallelized on a 4-core CPU and the latter runs on a Tesla K80 GPU; at the end, I get three predictions which I can later merge. However, after running the Theano-based, CUDA-parallelized neural network, if I rerun either RF or XGB, I will get the following error:

--> 965         model_scores = parallel_param(delayed(model_selection_loop)(p_idx, ..., method, ...) for p_idx in range(num_parameter_samples))

/home/s/anaconda2/lib/python2.7/site-packages/joblib/parallel.pyc in __call__(self, iterable)
    808                 # consumption.
    809                 self._iterating = False
--> 810             self.retrieve()
    811             # Make sure that we get a last message telling us we are done
    812             elapsed_time = time.time() - self._start_time

/home/s/anaconda2/lib/python2.7/site-packages/joblib/parallel.pyc in retrieve(self)
    725                 job = self._jobs.pop(0)
    726             try:
--> 727                 self._output.extend(job.get())
    728             except tuple(self.exceptions) as exception:
    729                 # Stop dispatching any new job in the async callback thread

/home/s/anaconda2/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
    565             return self._value
    566         else:
--> 567             raise self._value
    568 
    569     def _set(self, i, obj):

GpuArrayException: invalid argument

It seems to me that once the CUDA-based code (recurrent neural network) is executed, any call to Parallel() will get rerouted to the GPU instead of the CPU, since the error I get is GpuArrayException even though I had called a scikit-learn Random Forest classifier which should run on the CPU. In fact, if my first run is ordered like (1) NN, (2) RF, (3) XGB, I get the same error after finishing the first pass.

I should mention that I am running these in an IPython session, so I am typing the commands by hand in the terminal (although I am not sure if that matters or not).

Any ideas on why this happens and how I can route any execution of RF or XGB to the CPU regardless of whether a NN method was previously executed on the GPU or not? I'd appreciate your help.

parallelizing with joblib.Parallel() on CPU fails after running a parallelized CUDA-based code on GPU

0 Answers0