-2

I am trying to learn more about parallelisation to speed up this classification code. I literally started reading about it less than 24 hours ago (to share some context). I am wondering which multiprocessing technique will be the best to tackle this problem and what sort of speed improvement could I expect. Lastly, suggestion on how to structure the code will be highly appreciated. I am currently looking into the ray, joblib and multiprocessing libraries.

def clf(i):
cal_probs = []
for i, intem in enumerate(price):
    # cross validation strategy
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
    # Classifier
    tune_clf = CalibratedClassifierCV(SVC(gamma='scale', 
                                          class_weight='balanced', 
                                          C=0.01),  method="isotonic", 
                                          cv=cv).fit(X_train[[price[i], 
                                          'regime']], y_train[price[i]])
                                                 
    # Calibrated Probabilities
    pred_probs = tune_clf.predict_proba(X[[price[i], 'regime']])    
    cal_probs.append(pred_probs)
Carla
  • 1
  • 1
  • 1

1 Answers1

0

Multiprocessing is a complicated thing. What is "best" for your use case may be totally different to what is best for a different one.

The best route is highly dependent on your environment and data as serialization and deserialization incurs a significant overhead. In many cases, moving to multiprocessing can actually slow your code down. There are complexities of the solution, CPU overheads, memory overheads, IO overheads and all sorts of other things to consider.

For example, if it required ten-thousand lines of highly complex concurrent-proof code to shave off 0.1 seconds, you probably wouldn't consider that speed improvement worth the maintenance overhead, and so the "best" thing to do there would be to leave it as a single-threaded solution. Other use cases may deem this an acceptable cost.

Without this information, any answers given would need to make assumptions of what the relative importances of all the influential factors may be to you.

I would say the "best" is to use CalibratedClassifierCV's n_jobs parameter as this is very simple and easy to implement - it requires no further dependencies and you don't need to write any parallel code.

D Hudson
  • 1,004
  • 5
  • 12
  • Thank you for your reply D Hudson. The reasons I have started looking into multiprocessing is because I haven't seen and speed advantage using n_jobs=-1. – Carla Mar 07 '21 at 16:09