1

I am using XGBoost to train 1 million rows and ~15 features from Kaggle project Rossmann Store Sales. It appears very slow. It took 30 mins to train model with no parameter tuning. If I run GridSearchCV to train model with 3 folds and 6 learning rate values, it will take more than 10 hours to return. As it is my first time to use XGBoost, I don't know if this is normal or not. I can't imagine how many days it will take to tune all the parameters of XGBoost model. Please help me.

The model prarameters: XGBRegressor(learning_rate = 0.1,max_depth = 5,n_estimators = 1165,subsample=0.8,colsample_bytree=0.8,seed=27). I use n_estimators 1165 because it is returned by xgboost.train as best iterations. Also change nthread from 1 to 4 and it doesn't improve performace at all.

My computer configuration is; CPU:intel i7 6500U (2cores 4threads) Memory: 8GB OS: windows 10

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Justin
  • 69
  • 1
  • 1
  • 4
  • 1
    Training models can take a ton of time with certain data sets -- agnostic of machine learning model. For XGBoost, training time will vary depending on your hyperparameters so your training time doesn't seem unreseasonable to me. The long training times are why some people have worked hard on developing optimization techniques for 'smarter' hyperparameter tuning (vs using a grid search). – rmilletich Jan 10 '18 at 21:24
  • 1
    I am using hyper-prarameters: XGBRegressor(learning_rate = 0.1,max_depth = 5,n_estimators = 1165,subsample=0.8,colsample_bytree=0.8,seed=27). I use n_estimators 1165 because it is returned by xgboost.train as best iterations. Also change nthread from 1 to 4 and it doesn't improve performace at all. How multithreading works for xgboost? – Justin Jan 10 '18 at 21:30
  • To clarify here, hyperparameters are just another name for the model's parameters. Just about every model has some parameters (often called hyperparameters) that the user can tune or adjust. With my previous comment, I was mentioning that certain hyperparameters can make training take longer. For example, if the case of XGBoost, if you have more trees that are grown deeper, this will slow down the training process vs having less trees that are shallow (such a 'stumps'). – rmilletich Jan 10 '18 at 21:35
  • Increasing the number of threads doesn't improve performance is a bit unclear. Do you mean predictive performance of the model or performance of the model's runtime? From my understanding, XGBoost uses OpenMP for parallel processing: https://en.wikipedia.org/wiki/OpenMP. Check out this page for more detailed info on multithreading in XGBoost: https://machinelearningmastery.com/best-tune-multithreading-support-xgboost-python/ – rmilletich Jan 10 '18 at 21:38
  • I mean the model training time is not improved after changing nthread from 1 to 4 – Justin Jan 10 '18 at 21:40
  • First thought if training time is not decreasing, check your install of XGBoost to ensure you have the multithreading installed/enabled. Some time ago I installed XGBoost on Windows without multithreading support so I would start there. – rmilletich Jan 10 '18 at 21:47
  • 1
    I really can't tell if the version I installed is enabled with multithreading or not. Can you give me a hint? XGBoost v0.6 is installed with conda – Justin Jan 10 '18 at 22:02

1 Answers1

0

XGB takes a lot of time to run from what I've gathered plus your data is enormous, when I use XGB even on small simple datasets it takes a lot of time, you can try switching the tree growing policy tree_method to hist. With GPU it should be set to gpu_hist, it will run much faster but I think it would still take a long time.

Explanation and more about XGB XGBoost (Extreme Gradient Boosting) is known to be a powerful machine learning algorithm that can achieve very high performance on a wide range of problems. However, this power comes at the cost of computational complexity and training time.

XGBoost uses gradient boosting, which is an iterative method that trains a sequence of models, each one learning to correct the mistakes of the previous model. This process can be computationally intensive, especially when working with large datasets or when searching for optimal hyperparameters using grid search.

Additionally, XGBoost has many hyperparameters that can be tuned to achieve optimal performance on a specific problem. Tuning these hyperparameters can be time-consuming, as it requires training and evaluating many different models.

When adding XGBoost to a voting classifier, the time required to train and evaluate the model can increase significantly due to the complexity of the algorithm and the number of hyperparameters that need to be tuned. This is especially true when compared to simpler models such as Random Forest or Extra Trees, which have fewer hyperparameters and are generally faster to train.

M.Ayman
  • 1
  • 2