0

I am training a Decision Tree classifier on some pandas data-frame X.

clf = DecisionTreeClassifier()
clf = clf.fit(X, y)

Now I walk the tree clf.tree_ and want to get the records (preferably as a data-frame) that belong to that inner node or leaf. What I do at the moment is something like below.

fn = [ X.columns[i] if i != TREE_UNDEFINED else "undefined!"  for i in clf.tree_.feature ]

def recurse(node, tmp):
    tree = clf.tree_
    if self.test_node(tmp):
        return
    
    if tree.feature[node] != TREE_UNDEFINED:
        mask = tmp[fn[node]] <= tree.threshold[node]
        recurse(tree.children_left[node], tmp[mask])
        recurse(tree.children_right[node], tmp[~mask])
    
recurse(0, X)

This obviously works, but when doing it for 10K trees I discovered using profiler that 95+% on my code is spent splitting the data-frame. Running fit on the data is maybe 2% and what I do with the data-frame at each node is the rest.

Is there a more efficient way to split that data?

I assume the DT internally has to split the data(I can get the number of records per node). Can I somehow have it append put the df on the node?

** UPDATE **

It was suggested to use clf.decision_path(X).toarray(). In this matrix each column j represents a node and a 1 in row i means that it passed through the node.

I tried several "methods" to get the df per node using this matrix. All were slower than the naive method I currently use.

Walk tree: default: 2.4888 s +- 0.01 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
Walk tree: no recursion: 2.5427 s +- 0.07 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
Walk tree: decision path Numpy : 16.5346 s +- 0.08 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
Walk tree: decision path Scipy: 8.8154 s +- 0.56 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
Walk tree: decision path Pandas: 28.3901 s +- 0.69 s per loop (mean +- std. dev. of 10 runs, 50 loops each)

For Scipy, fastest of the methods using this array, I also tried to see whether getting the indices or the partial df is what takes most time.

Walk tree: decision path Scipy: 5.3404 s +- 0.20 s per loop (mean +- std. dev. of 10 runs, 30 loops each)
Walk tree: decision path Scipy (take=False): 4.5698 s +- 0.27 s per loop (mean +- std. dev. of 10 runs, 30 loops each)

I also tried to changed the basic recursion above to use df.query(..) but that was also slower.

mibm
  • 1,328
  • 2
  • 13
  • 23

1 Answers1

0

I believe

pd.DataFrame(clf.decision_path(X).toarray())

might be what you want. Entry [i, j] in the result will be 1 if observation i went through node j of the tree. Also there is a very good example on the decision tree structure available here that might be helpful.

sply88
  • 643
  • 3
  • 7
  • Unfortunately getting the records (rows) at node like this is (considerably) slower than my current method. Depending on how I get the non-zero elements per column `j` in the array it may be between 1.5x (Scipy and CSC matrix) to 3x slower (pandas df) – mibm Nov 14 '21 at 14:00
  • Just to clarify: The output created by the call in my answer is what you want, but its just taking to long? Or would you need to do additional processing on the node indicator matrix returned by `clf.decision_path(X)` to get to your desired end result? If so, could you maybe post a short example with a toy-dataset including the desired output? – sply88 Nov 18 '21 at 19:49
  • I need additional processing. I need the observations/records at each node, so I need to "slice" the original df by the indices in column `j`. Doing this for all nodes using the matrix is (surprisingly) much slower than naive method. I ended up changing my algorithm so that I can skip some nodes based on the info the decision tree already holds and got a 2x improvement (not enough, but ok) – mibm Nov 21 '21 at 08:19