Description
I have a python code that sequentially calls vaex.ml.catboost.CatBoostModel.fit
for 3 folds.
It takes a lot of time, I would like to run vaex.ml.catboost.CatBoostModel.fit
in parallel.
Problem
I get different results when I run vaex.ml.catboost.CatBoostModel.fit
sequentially and in parallel. Definitely, I do smth wrong. I expect parallel results be very close to sequential results (seed is not hardcoded, so there is always some minor fluctuation). Sequential and parallel versions produce absolutely incomparable results.
Here is sequential code. It produces approved result
estimator = CatBoostModel(
features=features + features_cat,
target=target,
num_boost_round=700,
prediction_name="catboost_prediction",
prediction_type=prediction_type
)
for fold in folds:
logging.info(f"training fold: {fold}") # 1,2,3
df_train = df[df.cv_fold != fold]
df_val = df[df.cv_fold == fold]
estimator.fit(df=df_train, evals=[df_val], early_stopping_rounds=100, verbose_eval=True)
cv_scores[cv_fold == fold] = estimator.predict(df_val)
Here is my parallel code:
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_result = {executor.submit(train_fold, fold, cv_scores, df, task, estimator): fold for fold in
folds}
for future in concurrent.futures.as_completed(future_to_result):
res = future_to_result[future]
(fold, result) = future.result()
logging.info(f"completed future for {fold}, result: {result.shape}")
cv_scores[cv_fold == fold] = result
def train_fold(fold,
cv_scores,
df, estimator: CatBoostModel):
logging.info(f"training fold: {fold}")
df_train = df[df.cv_fold != fold]
df_val = df[df.cv_fold == fold]
estimator.fit(df=df_train, evals=[df_val], early_stopping_rounds=100, verbose_eval=True)
result = estimator.predict(df_val)
return (fold, result)