When a GBDT is built with the sklearn.ensemble.GradientBoostingClassifier, I have a set of trees. I can figure out the structure of a single tree. But for a set of trees, how do I know in which way the trees are accessed?
Take the following codes for example,
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=4)
iris = load_iris()
clf = clf.fit(iris.data, iris.target)
Then, I have 4 trees, that is,
for i in range(4):
print(clf.estimators_[i,0].tree_)
And, I can do this with the trees,
clf.predict( array([0,1,2,3]).reshape(1, -1) )
But in which turn, the clf.estimators_[0,0].tree_ .. clf.estimators_[3,0].tree_ are accessed? and how the results are put together?
In the manual, it is said that, ``now expose an apply method for retrieving the leaf indices each sample ends up in under each try".
clf.apply( array([0,1,2,3]).reshape(1, -1) )
I obtained the following array,
[[ 1., 7., 10.],
[ 1., 7., 10.],
[ 4., 7., 10.],
[ 1., 1., 10.]]
But how to read it?
Update: ------
I have read some source code from here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1247 It seems that, the indices are merely indices, but encoded with the non-leaf nodes. That explains why there are only 8 leaves, but the incides can be larger than 8.
Another update: ------
After reading the codes here and here, I have finally figured it out that, GBDT's decision_function returns init_value+sum_{for each leaf}(learning_rate*leaf_value)
, and the prediction probablity is a simple function of decision_function.