12

I observed that scikit-learn clf.tree_.feature occasional return negative values. For example -2. As far as I understand clf.tree_.feature is supposed to return sequential order of the features. In case we have array of feature names ['feature_one', 'feature_two', 'feature_three'], then -2 would refer to feature_two. I am surprised with usage of negative index. In would make more sense to refer to feature_two by index 1. (-2 is reference convenient for human digestion, not for machine processing). Am I reading it correctly?

Update: Here is an example:

def leaf_ordering():
    X = np.genfromtxt('X.csv', delimiter=',')
    Y = np.genfromtxt('Y.csv',delimiter=',')
    dt = DecisionTreeClassifier(min_samples_leaf=10, random_state=99)
    dt.fit(X, Y)
    print(dt.tree_.feature)

Here are the files X and Y

Here is the output:

    [ 8  9 -2 -2  9  4 -2  9  8 -2 -2  0  0  9  9  8 -2 -2  9 -2 -2  6 -2 -2 -2
  2 -2  9  8  6  9 -2 -2 -2  8  9 -2  9  6 -2 -2 -2  6 -2 -2  9 -2  6 -2 -2
  2 -2 -2]
desertnaut
  • 57,590
  • 26
  • 140
  • 166
user1700890
  • 7,144
  • 18
  • 87
  • 183

2 Answers2

7

By reading the Cython source code for the tree generator we see that the -2's are just dummy values for the leaf nodes's feature split attribute.

Line 63

TREE_UNDEFINED = -2

Line 359

if is_leaf:
    # Node is not expandable; set node as leaf
    node.left_child = _TREE_LEAF
    node.right_child = _TREE_LEAF
    node.feature = _TREE_UNDEFINED
    node.threshold = _TREE_UNDEFINED
absolutelyNoWarranty
  • 1,888
  • 2
  • 17
  • 17
4

As you write, clr.tree_.feature returns the nodes/leaves in a sequential order as a Depth-First Search algorithm. First, it starts with the the root node and then follows the left children until it reaches a leaf (coded with -2), when it reaches the a leaf it climbs the tree from leaf to leaf until it reaches a node. Once it reaches a node, it descends again in the hierarchy until it reaches a leaf node.

Looking at the your example, the root node is feature 8 which has a left child, feature 9. Then if we descend the hierarchy, we immediately reach a leaf node. So we start going up until we reach a non-leaf node. The next node (the right child) is a leaf node as well (feature 9's two children are both leaf nodes), and then climbing up the tree we reach feature 9 again on the first level of hierarchy. Here feature 9 has a left child, feature 4 which has a leaf node as its left child, then we look at feature 4's right child which is feature 9 again, and so on.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Adam
  • 337
  • 3
  • 10