I'm trying to make a binary classification on a big dataset (5million rows x 450 features) using XGBoost Spark lib in AWS EMR.
I've attempted setting many different configurations like:
- Number of XGboost workers, nthreads, spark.task.cpus, spark.executor.instances, spark.executor.cores.
Even though I get different time in performance, when I analyze the cluster load through Ganglia it's always with a low load. I've been trying to maximize the use of resources for a faster classification because I'm running 1000 rounds on XGBoost, but no matter the parameters I set, I always get the same similar usage.
Here's the EMR setup I'm using. Master Node: 1 m4.xlarge Worker Node: 10 m4.2xlarge Total vCores on workers: 160
Some of the different parameters are here: Different Spark and XGBoost config I've tried
I'm performing 1000 num rounds, with 4 folds CrossValidation and some hyper paratemeter tunning (36 possible combinations). It's taking around 1 second for each iteration. The trainig itself will take around 40 hours then.
And the cluster load is really low. Cluster usage
Any tips on what can I do to better use my cluster resources and have a faster training? Is there something I'm missing when setting the number of XGboost workers, Spark executors or other configs? Or, there's nothing else to do and this cluster setup for this specific hardware is just overkill?