I am using the xgboost PySpark API
. This API is experimental but it supports most of the features of the xgboost API.
As per the documentation below, eval_set
parameter is not supported and instead, validationIndicatorCol
parameter should be used.
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.spark
https://databricks.github.io/spark-deep-learning/#module-sparkdl.xgboost
xgb = XgboostClassifier(featuresCol = "features", labelCol="label", num_workers = 40, random_state = 1, missing = None, objective = 'binary:logistic', validationIndicatorCol = 'isVal', eval_metric = 'aucpr' , n_estimators = best_n_estimators, max_depth = best_max_depth, learning_rate = best_learning_rate ) pipeline = Pipeline(stages=[vectorAssembler,xgb]) pipelineModel = pipeline.fit(sampled_df)
It seems to be running without any errors which is great.
How do you print and look at the evaluation results? Traditional xgboost has evals_result()
method which pipelineModel.stages[-1].evals_result()
doesn't seem to work in the PySpark API. This method should normally work since the PySpark API
documentation doesn't say otherwise. Any idea on how to make it work?