I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df. When creating a XGBoostEstimator, an error occur:
TypeError: 'JavaPackage' object is not callable Exception AttributeError: "'NoneType' object has no attribute '_detach'" in ignored
I have no experience with xgboost for spark, I have tried a few tutorials online but none worked.
I tried to covert to pandas df but the data is too large and I always get OutOfMemoryException
from the Java wrapper (I also tried to look it up and the solution did not work for me, raising the executor memory).
The latest tutorial I was following is:
After giving up on the XGBoost module, I started using sparkxgb
.
spark = create_spark_session('shai', 'dna_pipeline')
# sparkxgboost files
spark.sparkContext.addPyFile('resources/sparkxgb.zip')
def create_spark_session(username=None, app_name="pipeline"):
if username is not None:
os.environ['HADOOP_USER_NAME'] = username
return SparkSession \
.builder \
.master("yarn") \
.appName(app_name) \
.config(...) \
.config(...) \
.getOrCreate()
def train():
train_df = spark.table('dna.offline_features_train_full')
test_df = spark.table('dna.offline_features_test_full')
from sparkxgb import XGBoostEstimator
vectorAssembler = VectorAssembler() \
.setInputCols(train_df.columns) \
.setOutputCol("features")
# This is where the program fails
xgboost = XGBoostEstimator(
featuresCol="features",
labelCol="label",
predictionCol="prediction"
)
pipeline = Pipeline().setStages([xgboost])
pipeline.fit(train_df)
The full exception is:
Traceback (most recent call last):
File "/home/elad/DNA/dna/dna/run.py", line 283, in <module>
main()
File "/home/elad/DNA/dna/dna/run.py", line 247, in main
offline_model = train_model(True, home_dir=config['home_dir'], hdfs_client=client)
File "/home/elad/DNA/dna/dna/run.py", line 222, in train_model
model = train(offline_mode=offline, spark=spark)
File "/home/elad/DNA/dna/dna/model/xgboost_train.py", line 285, in train
predictionCol="prediction"
File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/__init__.py", line 105, in wrapper
return func(self, **kwargs)
File "/tmp/spark-7781039b-6821-42be-96e0-ca4005107318/userFiles-70b3d1de-a78c-4fac-b252-2f99a6761b32/sparkxgb.zip/sparkxgb/xgboost.py", line 115, in __init__
File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
Exception AttributeError: "'NoneType' object has no attribute '_detach'" in <bound method XGBoostEstimator.__del__ of XGBoostEstimator_4f54b37156fb0a113233> ignored
I have no idea why this exception happens nor do I know how to properly integrate sparkxgb into my code.
Help would be appreciated.
thanks