What is the right way to run python vaex.ml.catboost.CatBoostModel.fit in parallel for several folds?

Question

Description

I have a python code that sequentially calls vaex.ml.catboost.CatBoostModel.fit for 3 folds. It takes a lot of time, I would like to run vaex.ml.catboost.CatBoostModel.fit in parallel.

Problem

I get different results when I run vaex.ml.catboost.CatBoostModel.fit sequentially and in parallel. Definitely, I do smth wrong. I expect parallel results be very close to sequential results (seed is not hardcoded, so there is always some minor fluctuation). Sequential and parallel versions produce absolutely incomparable results.

Here is sequential code. It produces approved result

estimator = CatBoostModel(
        features=features + features_cat,
        target=target,
        num_boost_round=700,
        prediction_name="catboost_prediction",
        prediction_type=prediction_type
    )
 
for fold in folds:
    logging.info(f"training fold: {fold}")  # 1,2,3
    df_train = df[df.cv_fold != fold]
    df_val = df[df.cv_fold == fold]
    estimator.fit(df=df_train, evals=[df_val], early_stopping_rounds=100, verbose_eval=True)
    cv_scores[cv_fold == fold] = estimator.predict(df_val)

Here is my parallel code:

import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_to_result = {executor.submit(train_fold, fold, cv_scores, df, task, estimator): fold for fold in
                        folds}
    for future in concurrent.futures.as_completed(future_to_result):
        res = future_to_result[future]
        (fold, result) = future.result()
        logging.info(f"completed future for {fold}, result: {result.shape}")
        cv_scores[cv_fold == fold] = result

def train_fold(fold,
               cv_scores,
               df, estimator: CatBoostModel):
    logging.info(f"training fold: {fold}")
    df_train = df[df.cv_fold != fold]
    df_val = df[df.cv_fold == fold]
    estimator.fit(df=df_train, evals=[df_val], early_stopping_rounds=100, verbose_eval=True)
    result = estimator.predict(df_val)

    return (fold, result)

What is the right way to run python vaex.ml.catboost.CatBoostModel.fit in parallel for several folds?

0 Answers0