XGBoost does not use enough all resources while running Spark in AWS EMR

Question

I'm trying to make a binary classification on a big dataset (5million rows x 450 features) using XGBoost Spark lib in AWS EMR.

I've attempted setting many different configurations like:

Number of XGboost workers, nthreads, spark.task.cpus, spark.executor.instances, spark.executor.cores.

Even though I get different time in performance, when I analyze the cluster load through Ganglia it's always with a low load. I've been trying to maximize the use of resources for a faster classification because I'm running 1000 rounds on XGBoost, but no matter the parameters I set, I always get the same similar usage.

Here's the EMR setup I'm using. Master Node: 1 m4.xlarge Worker Node: 10 m4.2xlarge Total vCores on workers: 160

Some of the different parameters are here: Different Spark and XGBoost config I've tried

I'm performing 1000 num rounds, with 4 folds CrossValidation and some hyper paratemeter tunning (36 possible combinations). It's taking around 1 second for each iteration. The trainig itself will take around 40 hours then.

And the cluster load is really low. Cluster usage

Any tips on what can I do to better use my cluster resources and have a faster training? Is there something I'm missing when setting the number of XGboost workers, Spark executors or other configs? Or, there's nothing else to do and this cluster setup for this specific hardware is just overkill?

Have you checked utilization in just master node? I start small then scale. You can try Dask also. — Germán Alfaro, Mar 02 '18 at 08:18
I encountered the same problem when training big. dataset with xgboost on spark yarn. But no solution yet. — Shuai Liu, Jan 08 '19 at 11:49

score 0 · Answer 1 · answered Jan 09 '19 at 02:17

1、In XGBoost4J-Spark, each XGBoost worker is wrapped by a Spark task and the training dataset in Spark’s memory space is fed to XGBoost workers in a transparent approach to the user.

2 If you do want OpenMP optimization,

you have to set nthread to a value larger than 1 when creating XGBoostClassifier/XGBoostRegressor
set spark.task.cpus in Spark to the same value as nthread

A number of parameters need to be set correctly to make the training take advantage of all resources. Configuration example:

    val spark = SparkSession
 .builder()
 .appName("xgboost on spark, baseline of cross validation")
 .master("yarn")
 .config("spark.sql.warehouse.dir", warehouse_location)
 .config("spark.executor.instances", "10")
 .config("spark.executor.memory", "40g")
 .config("spark.executor.cores", "8")
 .config("spark.dirver.memory", "30g")
 .config("spark.dirver.cores", "4")
 .config("spark.task.cpus","4")
 .enableHiveSupport()
 .getOrCreate(); 


val booster = new XGBoostClassifier(
 Map("eta" -> 0.1f,
   "max-depth" -> 7,
   "objective" -> "multi:softprob",
   "num_round" -> 15,
   "num_workers" -> 20,
   "eval_metric" -> "mlogloss",
   "num_class" -> num_class.toInt
 )
)
//booster.setNumClass(num_class.toInt)
booster.setGamma(0.5)
booster.setNthread(4)

Above configuration running with 20 xgboost workers, 4 degree of parallelism for each worker.

XGBoost does not use enough all resources while running Spark in AWS EMR

1 Answers1