Summary:
I have an array x
of length n
and can run all kinds of O(nlog(n))
operations on x
and cache results. E.g. I can pre-compute indices = argsort(x)
and order = argsort(argsort(x))
. Now, given sample
, which is an array of indices 0
to n
of length at most n
, I would like to compute argsort(x[sample])
in O(length(sample))
time (or as fast as possible). Is this possible?
Background:
To train a decision tree on a dataset (X, y)
, at each split we are given an array with indices corresponding to obervations at the node (sample
), and need to compute y[argsort(X[sample, i])]
for each feature i
in my dataset. A random forest is an ensemble of decision trees trained on X[sample, :]
where sample
is a length n
array of indices . I am wondering if it is possible to only sort each feature once (i.e. pre-compute argsort(X[:, i])
for each i
) and reuse this in every tree.
One can assume that sample
is sorted.
Example
Consider x = [0.5, 9.5, 2.5, 8.5, 6.5, 3.5, 5.5, 4.5, 1.5, 7.5]
. Then indices = argsort(x) = [0, 8, 2, 5, 7, 6, 4, 9, 3, 1]
. Let sample = [9, 9, 5, 6, 4]
. we would like to obtain argsort(x[sample]) = [2, 3, 4, 0, 1]
without any sorting / in O(length(sample))
time.
Ideas
Given samples
, we can compute counts = tabulate(samples)
. For the above example this would be equal to [0, 0, 0, 0, 1, 1, 1, 0, 0, 2]
. If inverse_tabulate
is the inverse of tabulate
(irgnoring order), then inverse_tabulate(tabulate(samples)[indices]) = argsort(x[samples])
. However to my best understanding inverse_tabulate
is optimally O(n)
in time, which is suboptimal if length(sample) << n
.
References
This question discusses the runtime of decision trees. This lecture script mentiones on page 6, paragraph 4:
(Many implementations such as scikit-learn use efficient caching tricks to keep track of the general order of indices at each node such that the features do not need to be re-sorted at each node; hence, the time complexity of these implementations merely is O(m · n log(n)).)
This caching however seems to only be within one tree. Also, looking at the scikit-learn tree source code, the samples
appear to be re-sorted at each step / for each split.