0

I am using the xgboost PySpark API. This API is experimental but it supports most of the features of the xgboost API.

As per the documentation below, eval_set parameter is not supported and instead, validationIndicatorCol parameter should be used.

  1. https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.spark

  2. https://databricks.github.io/spark-deep-learning/#module-sparkdl.xgboost

    xgb = XgboostClassifier(featuresCol = "features", 
                            labelCol="label", 
                            num_workers = 40, 
                            random_state = 1,
                            missing = None, 
                            objective = 'binary:logistic',
                            validationIndicatorCol = 'isVal',
                            eval_metric = 'aucpr' ,
                            n_estimators = best_n_estimators, 
                            max_depth = best_max_depth, 
                            learning_rate = best_learning_rate       
                           )
    
     pipeline = Pipeline(stages=[vectorAssembler,xgb])
     pipelineModel = pipeline.fit(sampled_df)
    

It seems to be running without any errors which is great.

How do you print and look at the evaluation results? Traditional xgboost has evals_result() method which pipelineModel.stages[-1].evals_result() doesn't seem to work in the PySpark API. This method should normally work since the PySpark API documentation doesn't say otherwise. Any idea on how to make it work?

Vusal
  • 11
  • 2
  • I am attempting to do something similar, except with LightGBM (whose PySpark interface largely mirrors that of XGBoost for PySpark). I will let you know what I find. One question for you: are you passing the 'isVal' column to the VectorAssembler as one of the inputCols, or no? – Greg Aponte Feb 13 '23 at 20:07
  • Is there LightGBM for PySpark? I didn't know that. Yes, the isVal column should be fed into the Vector Assembler – Vusal Mar 23 '23 at 16:23

0 Answers0