How can I integrate xgboost in spark? (Python)

Question

I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df. When creating a XGBoostEstimator, an error occur:

TypeError: 'JavaPackage' object is not callable Exception AttributeError: "'NoneType' object has no attribute '_detach'" in ignored

I have no experience with xgboost for spark, I have tried a few tutorials online but none worked. I tried to covert to pandas df but the data is too large and I always get OutOfMemoryException from the Java wrapper (I also tried to look it up and the solution did not work for me, raising the executor memory).

The latest tutorial I was following is:

https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb

After giving up on the XGBoost module, I started using sparkxgb.

spark = create_spark_session('shai', 'dna_pipeline')
# sparkxgboost files 
spark.sparkContext.addPyFile('resources/sparkxgb.zip')

def create_spark_session(username=None, app_name="pipeline"):
    if username is not None:
        os.environ['HADOOP_USER_NAME'] = username

    return SparkSession \
        .builder \
        .master("yarn") \
        .appName(app_name) \
        .config(...) \
        .config(...) \
        .getOrCreate()

def train():
    train_df = spark.table('dna.offline_features_train_full')
    test_df = spark.table('dna.offline_features_test_full')

    from sparkxgb import XGBoostEstimator

    vectorAssembler = VectorAssembler() \
        .setInputCols(train_df.columns) \
        .setOutputCol("features")

    # This is where the program fails
    xgboost = XGBoostEstimator(
        featuresCol="features",
        labelCol="label",
        predictionCol="prediction"
    )

    pipeline = Pipeline().setStages([xgboost])
    pipeline.fit(train_df)

The full exception is:

Traceback (most recent call last):
  File "/home/elad/DNA/dna/dna/run.py", line 283, in <module>
    main()
  File "/home/elad/DNA/dna/dna/run.py", line 247, in main
    offline_model = train_model(True, home_dir=config['home_dir'], hdfs_client=client)
  File "/home/elad/DNA/dna/dna/run.py", line 222, in train_model
    model = train(offline_mode=offline, spark=spark)
  File "/home/elad/DNA/dna/dna/model/xgboost_train.py", line 285, in train
    predictionCol="prediction"
  File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/__init__.py", line 105, in wrapper
    return func(self, **kwargs)
  File "/tmp/spark-7781039b-6821-42be-96e0-ca4005107318/userFiles-70b3d1de-a78c-4fac-b252-2f99a6761b32/sparkxgb.zip/sparkxgb/xgboost.py", line 115, in __init__
  File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
Exception AttributeError: "'NoneType' object has no attribute '_detach'" in <bound method XGBoostEstimator.__del__ of XGBoostEstimator_4f54b37156fb0a113233> ignored

I have no idea why this exception happens nor do I know how to properly integrate sparkxgb into my code.

Help would be appreciated.

thanks

score 8 · Accepted Answer · answered Sep 16 '19 at 11:31

After a day of debugging the hell out of this module, the problem was just submitting the jars incorrectly. I downloaded the jars locally and pyspark-submit them using:

PYSPARK_SUBMIT_ARGS=--jars resources/xgboost4j-0.72.jar,resources/xgboost4j-spark-0.72.jar

This fixed the problem.

score 3 · Answer 2 · answered Sep 15 '19 at 21:39

3

Instead of using XGBoost you can try using LightGBM which is a similar and arguably better (at least faster) algorithm. It works pretty much out of the box in pyspark, you can read more here

answered Sep 15 '19 at 21:39

Daniel

1,132
8
12

2

Although I found the solution, I will look into this module, thank you – Elad Cohen Sep 16 '19 at 11:32

Richard Rublev · Answer 3 · 2019-09-15T08:41:29.287

-1

Newer Apache Spark(2.3.0) version does not have XGBoost. You should try with Pyspark. You must convert your Spark dataframe to pandas dataframe.

This is excellent article that gives workflow and explanation xgboost and spark

Ok,I read again your post and you claim that dataset is too large. May be you should try Apache Arrow. Check this Speeding up Pyspark with Apache Arrow

edited Sep 15 '19 at 08:41

answered Sep 15 '19 at 08:30

Richard Rublev

7,718
16
77
121

hello, I tried using many tools like Arrow, including Arrow, it still did not work, I got the same `OutOfMemoryException` error. Spark itself does not have XGBoost but I am trying to use this api: https://github.com/dmlc/xgboost/pull/4656 – Elad Cohen Sep 15 '19 at 10:01
What is the actual size of your dataset? – Richard Rublev Sep 15 '19 at 10:07
Around 6gb (bigger dfs are around 12gb) – Elad Cohen Sep 15 '19 at 10:30
Try to upload one dfs somewhere then post a link. Of course if data is not confidential. – Richard Rublev Sep 15 '19 at 11:12
Unfortunately, I cannot share the data. – Elad Cohen Sep 15 '19 at 12:44

How can I integrate xgboost in spark? (Python)

3 Answers3