0

Summary:

I have an array x of length n and can run all kinds of O(nlog(n)) operations on x and cache results. E.g. I can pre-compute indices = argsort(x) and order = argsort(argsort(x)). Now, given sample, which is an array of indices 0 to n of length at most n, I would like to compute argsort(x[sample]) in O(length(sample)) time (or as fast as possible). Is this possible?

Background:

To train a decision tree on a dataset (X, y), at each split we are given an array with indices corresponding to obervations at the node (sample), and need to compute y[argsort(X[sample, i])] for each feature i in my dataset. A random forest is an ensemble of decision trees trained on X[sample, :] where sample is a length n array of indices . I am wondering if it is possible to only sort each feature once (i.e. pre-compute argsort(X[:, i]) for each i) and reuse this in every tree.

One can assume that sample is sorted.

Example

Consider x = [0.5, 9.5, 2.5, 8.5, 6.5, 3.5, 5.5, 4.5, 1.5, 7.5]. Then indices = argsort(x) = [0, 8, 2, 5, 7, 6, 4, 9, 3, 1]. Let sample = [9, 9, 5, 6, 4]. we would like to obtain argsort(x[sample]) = [2, 3, 4, 0, 1] without any sorting / in O(length(sample)) time.

Ideas

Given samples, we can compute counts = tabulate(samples). For the above example this would be equal to [0, 0, 0, 0, 1, 1, 1, 0, 0, 2]. If inverse_tabulate is the inverse of tabulate (irgnoring order), then inverse_tabulate(tabulate(samples)[indices]) = argsort(x[samples]). However to my best understanding inverse_tabulate is optimally O(n) in time, which is suboptimal if length(sample) << n.

References

This question discusses the runtime of decision trees. This lecture script mentiones on page 6, paragraph 4:

(Many implementations such as scikit-learn use efficient caching tricks to keep track of the general order of indices at each node such that the features do not need to be re-sorted at each node; hence, the time complexity of these implementations merely is O(m · n log(n)).)

This caching however seems to only be within one tree. Also, looking at the scikit-learn tree source code, the samples appear to be re-sorted at each step / for each split.

desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

0

I doubt that this is possible for worst case runtime. But average run-time, assuming a random sample, it is.

The idea is to do a radix sort sending each sample to the bucket:

position of sample in overall list * number of samples / n

Each bucket should get a number of samples described by a Poisson distribution with λ = 1. So walk through the buckets in order, sort it with your favorite sorting algorithm, then add it to the list.

It is worth noting that for arrays below 20-30 elements, insertion sort tends to be the fastest. The odds against a bucket having more elements than that are truly astronomical. So I'd recommend using insertion sort.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • Thanks for your answer. If I understand correctly a radix sort can be always be applied if I know the distribution of my values to sort. I don't expect it to be very quick in practice. I am looking for a solution that uses the assumptions stated above, i.e. the fact that we are able to "pre-sort" before sampling. – M. Londschien Dec 24 '21 at 08:36
  • @M.Londschien A pure radix sort can be applied if you know the distribution exactly. The kind of radix bucketing sort that I gave will work if you know it approximately. If it is a random sample, then thanks to your "pre-sort", we know the order approximately – btilly Dec 24 '21 at 16:20