1

When a GBDT is built with the sklearn.ensemble.GradientBoostingClassifier, I have a set of trees. I can figure out the structure of a single tree. But for a set of trees, how do I know in which way the trees are accessed?

Take the following codes for example,

    from sklearn.datasets import load_iris
    from sklearn import tree
    from sklearn.ensemble import GradientBoostingClassifier

    clf = GradientBoostingClassifier(n_estimators=4)
    iris = load_iris()

    clf = clf.fit(iris.data, iris.target)

Then, I have 4 trees, that is,

    for i in range(4):
        print(clf.estimators_[i,0].tree_)

And, I can do this with the trees,

    clf.predict( array([0,1,2,3]).reshape(1, -1) )

But in which turn, the clf.estimators_[0,0].tree_ .. clf.estimators_[3,0].tree_ are accessed? and how the results are put together?

In the manual, it is said that, ``now expose an apply method for retrieving the leaf indices each sample ends up in under each try".

    clf.apply( array([0,1,2,3]).reshape(1, -1) )

I obtained the following array,

[[  1.,   7.,  10.],
 [  1.,   7.,  10.],
 [  4.,   7.,  10.],
 [  1.,   1.,  10.]]

But how to read it?

Update: ------

I have read some source code from here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1247 It seems that, the indices are merely indices, but encoded with the non-leaf nodes. That explains why there are only 8 leaves, but the incides can be larger than 8.

Another update: ------

After reading the codes here and here, I have finally figured it out that, GBDT's decision_function returns init_value+sum_{for each leaf}(learning_rate*leaf_value), and the prediction probablity is a simple function of decision_function.

0 Answers0