I am trying to hyperparameter tune an XGBoost model using the bayesian-optimization library, and I continually get a segmentation fault during xgboost cross validation, regardless of how large or small my training data is.
I have a dataset with 118 features and about 1.7 million data points, which takes up about 5.6gb of space. Whenever I run the following code, I always get a segmentation error:
#Import and transform data for bayes opt tuning
import os
import sys
#append path
current_dir = os.getcwd()
sys.path.append(current_dir)
import faulthandler
import sklearn
from sklearn.preprocessing import MinMaxScaler
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import xgboost as xgb
from bayes_opt import BayesianOptimization
import numpy as np
import pandas as pd
faulthandler.enable()
print("Importing data")
transformed_data = pd.read_csv(os.getcwd() + "//saved_dataframes/1000_v3")
transformed_data = transformed_data.sample(frac = 1).reset_index(drop=True)
scaler = MinMaxScaler().set_output(transform="pandas")
full_y = transformed_data[['target']].to_numpy()
transformed_data = transformed_data.drop('target', axis=1)
full_x = scaler.fit_transform(transformed_data)
full_y = full_y.reshape(full_y.shape[0], 1)
del transformed_data
dtrain = xgb.DMatrix(full_x, label=full_y)
del full_x, full_y
# Define the objective function for Bayesian optimization
def xgb_cv(max_depth, learning_rate, subsample, colsample_bytree, colsample_bylevel, min_child_weight, max_delta_step, reg_lambda, reg_alpha, gamma, n_estimators):
params = {'objective': 'multi:softprob',
'num_class': 3,
'tree_method': 'approx',
'max_depth': int(max_depth),
'learning_rate': learning_rate,
'subsample': subsample,
'colsample_bytree': colsample_bytree,
'colsample_bylevel': colsample_bylevel,
'min_child_weight': min_child_weight,
'max_delta_step': int(max_delta_step),
'reg_lambda': reg_lambda,
'reg_alpha': reg_alpha,
'gamma': gamma
}
cv_result = xgb.cv(params, dtrain, num_boost_round=int(n_estimators), early_stopping_rounds=10, nfold=5, metrics='auc')
return -cv_result['test-auc-mean'].iloc[-1]
pbounds = { 'learning_rate': (0.001, 1.0),
'min_child_weight': (0, 10),
'max_depth': (3, 20),
'max_delta_step': (0, 20),
'subsample': (0.25, 1.0),
'colsample_bytree': (0.1, 1.0),
'colsample_bylevel': (0.1, 1.0),
'reg_lambda': (0, 1000.0),
'reg_alpha': (0, 1000.0),
'gamma': (0, 20),
'n_estimators': (50, 400)
}
# Create a BayesianOptimization object and run the optimization
print('Performing hyperparameter tuning using Bayesian optimization...')
optimizer = BayesianOptimization(f=xgb_cv, pbounds=pbounds, verbose = 10)
optimizer.maximize(init_points=5, n_iter=300)
print(optimizer.max)
I've used faulthandler to trace back the last calls before the segmentation fault, and it outputs one of two traces:
Fatal Python error: Segmentation fault
Thread 0x00007f9cfecf6740 (most recent call first):
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/core.py", line 1918 in update
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 215 in update
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 229 in update
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 538 in cv
File "/home/vincent/stock_algorithm.py", line 63 in xgb_cv
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/target_space.py", line 236 in probe
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/bayesian_optimization.py", line 208 in probe
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/bayesian_optimization.py", line 310 in maximize
File "/home/vincent/stock_algorithm.py", line 82 in <module>
or
Fatal Python error: Aborted
Thread 0x00007fb44c9e2740 (most recent call first):
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/core.py", line 1989 in eval_set
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 219 in eval
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 233 in <listcomp>
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 233 in eval
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/callback.py", line 232 in after_iteration
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/xgboost/training.py", line 540 in cv
File "/home/vincent/stock_algorithm.py", line 62 in xgb_cv
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/target_space.py", line 236 in probe
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/bayesian_optimization.py", line 208 in probe
File "/home/vincent/mambaforge/envs/stock_algorithm_env/lib/python3.11/site-packages/bayes_opt/bayesian_optimization.py", line 310 in maximize
File "/home/vincent/stock_algorithm.py", line 81 in <module>
For the first segmentation fault, line 1918 in core.py (the last traced call) calls XGBoosterUpdateOneIter in xgboost's C library.
For the second segmentation fault, line 1989 in core.py calls XGBoosterEvalOneIter in xgboost's C library.
The computer I am running on has 64gb or memory. Each time I run this code, I monitor the amount of memory it consumes, and the highest memory usage I've seen is 22.3gb. I have tried reducing the amount of data I use down to 50%, 33%, 25%, 10%, 5% and 1% to no avail.
Sometimes, the code is able to do a few (2-4) iterations of bayes optimization, however it always gets the segmentation fault.
I have used mamba to setup my environment, and these are the following versions of each relevant package I am using:
python3 = 3.11.4, xgboost = 1.7.4, scikit-learn = 1.3.0, numpy = 1.25.0, pandas = 2.0.3
The local machine I am using run this code has Ubuntu 22.04 installed, 64gb of memory, and an intel 13th gen processor.