How to sort an array (large N) that is populated with lists in the FASTest way possible?

Question

I am in need of a HYPER efficient sort algorithm. The built in Python .sort and sorted are fast, but not fast enough for my task. Likewise, I cannot use numpy.sort() because I need to sort an array (filled with lists). I cannot seem to find a GitHub library that will sort an array, filled with lists. I also need the ability to toggle ascending/descending. The array is large and a similar size array will be sorted thousands of times for different datasets. Any links or code would be much appreciated!

ex1 = {'index': 0, 'value': 72}
ex2 = {'index': 1, 'value': 49}
ex9999 = {'index': 9999, 'value': 121}
array = [ex1, ex2, ex9999]
array.sort(key=lambda x: x['index'], reverse=False)
#how to sort array of lists in native python  (just too slow)

above sort takes 0.3 seconds (for 20K data points) BUT with 10K arrays of that size to sort, that runtime is just too slow. Acceptable would be 1/10 that which I know is possible from this post https://www.quora.com/What-is-the-absolute-fastest-way-to-sort-a-very-large-random-list-of-integers-in-python, just not able to sort an array filled with lists

How large? What is the variable `data`? If index is always integer of reasonable size, have you considered Counting sort or Radix sort? — Ignatius, Apr 08 '19 at 03:03
Your example doesn't seem to match your description. It looks more like a list of dicts. If `array` is the thing you want sorted. `data` is not defined at all. To get good answers you should clarify. — Paul Panzer, Apr 08 '19 at 03:12
@Taegyung, the array being sorted is fixed size, and not SO large (between 15K-100K)(which takes .sort ≈0.3 seconds) BUT it gets sorted about 10K times which really adds up. — James Cannon, Apr 08 '19 at 03:25
@Taegyung, from this post: https://www.quora.com/What-is-the-absolute-fastest-way-to-sort-a-very-large-random-list-of-integers-in-python , I considered Count Sort but cannot get the code to work for array of lists — James Cannon, Apr 08 '19 at 03:26
*"... that is populated with lists ..."* You've edited the question, but your `array` (a list) is still populated with dictionaries (not lists), and your sort key is the value associated with the "index" key of each dictionary. That value is apparently an integer. It would be less confusing for the readers (and potential answerers) if your description matched the code. — Warren Weckesser, Apr 08 '19 at 03:37
Python can sort 20 thousand data points *way* faster than the 0.3 seconds you claim. It should only take a few milliseconds. — user2357112, Apr 08 '19 at 03:45

J_H · Answer 1 · 2019-04-09T18:19:19.407

Sort tuples instead.

tuples = [(d['index'], d['value'])
          for d in array]
tuples.sort()

You didn't post any timeit data. Show us representative data, and an actual timing, and then describe what kind of revised timing would be acceptable. It's not clear that you can beat timsort, though certainly the lambda overhead will be significant.

If you need faster still, strip out the irrelevant value attribute:

indices = [d['index']
           for d in array]
indices.sort()

Several elapsed times matter:

time to create list
time to sort list
time to use sorted list

As stated, your question is underspecified, since it does not constrain (1.) or (3.), and we all know there are lies, damn lies, and micro benchmarks.

The initial (semi-sorted) order, the distribution of values, and the access pattern against the sorted list all matter for the final elapsed time.

Some problems need only a subset of the full python3 semantics, and are amenable to numba optimization. You haven't told us enough to tell whether that's applicable to your business problem.

EDIT

Timsort on a modern platform can easily sort 4 million items per second in the tuple form, somewhat less than that if lambda overhead is necessary.

You didn't post timing data. You described a requirement to sort 700 K items per second on unknown hardware, and asserted that the posted code wasn't capable of that.

The posted code offered indices in sequential (sorted) order, which seemed odd, but I reproduced that aspect for tuple sorting in the code below.

Here is what I'm running, on a 2.9 GHz intel core i7 mac laptop:

#! /usr/bin/env python

from time import time
import random


def elapsed(fn):
    def print_elapsed(*args, **kw):
        t0 = time()
        ret = fn(*args, **kw)
        print(fn.__name__, '%.3f sec' % (time() - t0))
        return ret
    return print_elapsed


@elapsed
def get_values(k=2_000_000, base_val=42):
    return [dict(index=random.randint(0, 3e6), value=i + base_val + i % 10)
            for i in range(k)]


@elapsed
def get_tuples(dicts):
    return [(d['index'], d['value'])
            for d in dicts]


@elapsed
def get_indices(dicts):
    return [d['index']
            for d in dicts]


@elapsed
def sort_dicts(dicts):
    dicts.sort(key=lambda x: x['index'])


@elapsed
def sort_values(x, reverse=False):
    x.sort(reverse=reverse)


if __name__ == '__main__':
    dicts = get_values()
    sort_dicts(dicts)
    tuples = get_tuples(dicts)
    sort_values(tuples)
    indices = get_indices(dicts)
    sort_values(indices)

Output for 2 M items:

get_values  3.307 sec
sort_dicts  2.121 sec
get_tuples  1.355 sec
sort_values 0.414 sec
get_indices 0.715 sec
sort_values 0.329 sec

Reducing the problem size down to your stated 20 K items,

get_values  0.034 sec
sort_dicts  0.006 sec
get_tuples  0.005 sec
sort_values 0.001 sec
get_indices 0.002 sec
sort_values 0.001 sec

or even for ten times larger 200 K items which encounters cache misses:

get_values  0.325 sec
sort_dicts  0.105 sec
get_tuples  0.111 sec
sort_values 0.027 sec
get_indices 0.064 sec
sort_values 0.021 sec

it is hard to see how you could be encountering the slowness you describe. There must be some unseen aspect to the problem: you are running on a slow clock rate CPU, or at some level the target host's cache is small, or DRAM is slow, or there is another aspect to the data you're sorting that you have not yet revealed to us. The "populated with lists" part of your question is not apparent in the code you posted. You have not yet addressed whether techniques like cython or numba are relevant to your business problem. Maybe you do have a "slow sorting" technical issue, but what you have shared with us does not yet offer evidence of that.

Okay, updated post. Any links would be much appreciated. – James Cannon Apr 08 '19 at 03:29 — James Cannon, Apr 08 '19 at 03:29

How to sort an array (large N) that is populated with lists in the FASTest way possible?

1 Answers1