What's the difference between threshold and feature (for each of trained nodes) in scikit-learn DecisitonTreeClassifier?

Question

I've gone through the data structure of DecisionTreeClassifier in scikit-learn. Simply speaking, I just saw this page https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html , which is helpful for me as I need to extract internal data in a trained decision tree. But, one question popped up. For each node, there are threshold value and feature value. The threshold is fine. For a test phase where a feature vector (from test data) is taken as input to the tree and one of the features is mapped to a node which we compare the feature (from test data) and the threshold.

What exactly is the feature (from training data) in the trained tree? The following is the code snippet.

import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
clf.fit(X_train, y_train)
n_nodes = clf.tree_.node_count
children_left = clf.tree_.children_left
children_right = clf.tree_.children_right

# This is an array where one feature value 
# is associated with each node in the tree trained.
# What's the meaning of the feature for each node
# in the trained tree?
feature = clf.tree_.feature
threshold = clf.tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)

# what this shows is `[ 3 -2  2 -2 -2]`, 
# where the 1st, 3rd, 4th nodes are leaves 
# and associated with -2. 
# What are 3 and 2 on the other split node? 
# How were these values determined?
print(feature)

The dimension of the feature vector in this case is 4, and there are 5 nodes including both leaf and non-leaf nodes in the tree. The feature is [ 3 -2 2 -2 -2], where everything but the 0-th and 2nd is leaf node. Non-leaf node is associated with values 2 or 3. What's the meaning of this? Does this mean that for a feature vector (from test data) x=(x0, x1, x2, x3), we use x3 on the 0-th node and perform comparison with its threshold whereas we use x2 on the 2nd node and perform comparison with its threshold?

score -1 · Answer 1 · answered Mar 04 '21 at 09:57

I reccomend you check this answered question, it is quite useful. Anyway I will explain it with an example to clarify.

First the definition that Adam gives: clr.tree_.feature returns the nodes/leaves in a sequential order as a Depth-First Search algorithm. First, it starts with the the root node and then follows the left children until it reaches a leaf (coded with -2), when it reaches the a leaf it climbs the tree from leaf to leaf until it reaches a node. Once it reaches a node, it descends again in the hierarchy until it reaches a leaf node.

Let's see it with an example. First we plot the decision tree:

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf, 
                   feature_names=iris.feature_names,  
                   class_names=iris.target_names,
                   filled=True)

And now let's plot the feature:

array([ 3, -2,  2, -2, -2], dtype=int64)

I will also plot the feature names:

iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Let's go deeper. We have the clr.tree_.feature equal to [ 3 -2 2 -2 -2]. As the definition says, we explore the tree going down an starting to the left side. Let's explore it in order:

Which is the first feature we finde? The root one --> "petal width (cm)', which is the feature 3. Feature=[3]
Next we go down and left (orange leaf), as it is a final leaf, we return -2. Feature=[3, -2]
Now let's go up, we are again in the root node, and know let's go to the right. Guess which feature is? 'Petal length (cm)', the feature number 2. Feature=[3, -2, 2]
Let's go down and left (the green leaf), as it is a final leaf, we return -2. Feature=[3, -2, 2, -2]
We go up again to the 'Petal length (cm)' again, and now we move right (strong purple). We are again in a leaf node. We return a -2. Feature=[3, -2, 2, -2, -2]

So finally we obtain: Feature=[3, -2, 2, -2, -2]

Thanks!!!!!!!!!!!!!!!!! Is the fact that the traverse is done in a "Depth-First Search" manner documented somewhere? I could not find the original information. But I could confirm that results with another datasets showed things in DFS. — user9414424, Mar 08 '21 at 05:59
The example we've shown is not a good example because both DFS/BFS show the same traversal result. — user9414424, Mar 08 '21 at 06:00
I couldn't find the documentation to confirm if "Depth-First Search" is applied. As the link I provide before in my answer, I am almost 100% confident that we are applying DFS. Anyway, to confirm it, try another simple example with more nodes and it will be easy to say if the traverse is done in DFS or BFS. — Alex Serra Marrugat, Mar 08 '21 at 07:05
I think pre-order traversal is more appropriate here for the node indexing. — user9414424, Mar 10 '21 at 08:51

What's the difference between threshold and feature (for each of trained nodes) in scikit-learn DecisitonTreeClassifier?

1 Answers1