I am training a Decision Tree classifier on some pandas data-frame X
.
clf = DecisionTreeClassifier()
clf = clf.fit(X, y)
Now I walk the tree clf.tree_
and want to get the records (preferably as a data-frame) that belong to that inner node or leaf. What I do at the moment is something like below.
fn = [ X.columns[i] if i != TREE_UNDEFINED else "undefined!" for i in clf.tree_.feature ]
def recurse(node, tmp):
tree = clf.tree_
if self.test_node(tmp):
return
if tree.feature[node] != TREE_UNDEFINED:
mask = tmp[fn[node]] <= tree.threshold[node]
recurse(tree.children_left[node], tmp[mask])
recurse(tree.children_right[node], tmp[~mask])
recurse(0, X)
This obviously works, but when doing it for 10K trees I discovered using profiler that 95+% on my code is spent splitting the data-frame. Running fit on the data is maybe 2% and what I do with the data-frame at each node is the rest.
Is there a more efficient way to split that data?
I assume the DT internally has to split the data(I can get the number of records per node). Can I somehow have it append put the df on the node?
** UPDATE **
It was suggested to use clf.decision_path(X).toarray()
. In this matrix each column j
represents a node and a 1
in row i
means that it passed through the node.
I tried several "methods" to get the df per node using this matrix. All were slower than the naive method I currently use.
Walk tree: default: 2.4888 s +- 0.01 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
Walk tree: no recursion: 2.5427 s +- 0.07 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
Walk tree: decision path Numpy : 16.5346 s +- 0.08 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
Walk tree: decision path Scipy: 8.8154 s +- 0.56 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
Walk tree: decision path Pandas: 28.3901 s +- 0.69 s per loop (mean +- std. dev. of 10 runs, 50 loops each)
For Scipy, fastest of the methods using this array, I also tried to see whether getting the indices or the partial df is what takes most time.
Walk tree: decision path Scipy: 5.3404 s +- 0.20 s per loop (mean +- std. dev. of 10 runs, 30 loops each)
Walk tree: decision path Scipy (take=False): 4.5698 s +- 0.27 s per loop (mean +- std. dev. of 10 runs, 30 loops each)
I also tried to changed the basic recursion above to use df.query(..)
but that was also slower.