Difference in AUCs b/w Apache-Spark's GBT and sklearn

Question

I tried GBDTs both with Python's sklearn as well as Spark's local stand-alone MLlib implementation with default settings for a binary classification problem. I kept the numIterations, loss function same in both the cases. The features are all real valued and continuous. However, the AUC in MLLib implementation was way off compared to sklearn's. These were the parameters for sklearn's classifier:

GradientBoostingClassifier(
    init=None, learning_rate=0.001, loss='deviance',max_depth=8,
    max_features=None, max_leaf_nodes=None, min_samples_leaf=1, 
    min_samples_split=2, min_weight_fraction_leaf=0.0, 
    n_estimators=100, random_state=None, subsample=1.0, 
    verbose=0, warm_start=False)

I wanted to check if there's a way to figure and set these params in MLlib or if MLlib also assumes same settings (which are pretty standard).

Any pointers to figure the difference would be helpful.

score 0 · Answer 1 · answered Dec 12 '15 at 02:19

0

Both the set of customizable parameters and default values differ between scikit-learn and Spark MLlib. In particular default learning read in Spark is 0.1 and the maximum depth is 3.

Much more important though are the changes in the algorithms required to achieve reasonable scaling. Probably the most significant is binning of the continuous variables. So it it is rather unlikely you get the same results even if input parameters look more or less the same at first glance.

See also: Scalable Distributed Decision Trees in Spark MLLib.

answered Dec 12 '15 at 02:19

zero323

322,348
103
959
935

thanks @zero323. I set the maxDepth same in both. But binning is something I didn't know that spark was doing. My impression is that GBDT uses a regression tree and would not use binning unless specified. Is there a way to set/change these params in Spark like turning off binning. – Darth_SK Dec 13 '15 at 11:30
Binning is customizable but cannot be really turned off. You could set it to number larger than number of samples but it is a rather bad idea. – zero323 Dec 13 '15 at 13:39
I agree. However, I wonder how did the MLLib team benchmark and compare with some existing libraries? Although, it could be data specific I saw RoC AUC drop to 0.53 with MLLib from 0.67 of scikit-learn where every feature is a real-valued one. – Darth_SK Dec 14 '15 at 13:13

Difference in AUCs b/w Apache-Spark's GBT and sklearn

1 Answers1