How can I vectorize this python count sort so it is absolutely as fast as it can be?

Question

I am trying to write a count sort in python to beat the built-in timsort in certain situations. Right now it beats the built in sorted function, but only for very large arrays (1 million integers in length and longer, I haven't tried over 10 million) and only for a range no larger than 10,000. Additionally, the victory is narrow, with count sort only winning by a significant margin in random lists specifically tailored to it.

I have read about astounding performance gains that can be gained from vectorizing python code, but I don't particularly understand how to do it or how it could be used here. I would like to know how I can vectorize this code to speed it up, and any other performance suggestions are welcome.

Current fastest version for just python and stdlibs:

from itertools import chain, repeat

def untimed_countsort(unsorted_list):
    counts = {}
    for num in unsorted_list:
        try:
            counts[num] += 1
        except KeyError:
            counts[num] = 1

    sorted_list = list(
        chain.from_iterable(
            repeat(num, counts[num])
            for num in xrange(min(counts), max(counts) + 1)))
    return sorted_list

All that counts is raw speed here, so sacrificing even more space for speed gains is completely fair game.
I realize the code is fairly short and clear already, so I don't know how much room there is for improvement in speed.
If anyone has a change to the code to make it shorter, as long as it doesn't make it slower, that would be awesome as well.
Execution time is down almost 80%! Now three times as fast as Timsort on my current tests!

The absolute fastest way to do this by a LONG shot is using this one-liner with numpy:

def np_sort(unsorted_np_array):
    return numpy.repeat(numpy.arange(1+unsorted_np_array.max()), numpy.bincount(unsorted_np_array))

This runs about 10-15 times faster than the pure python version, and about 40 times faster than Timsort. It takes a numpy array in and outputs a numpy array.

[Code Review](http://codereview.stackexchange.com/) may be a better site for this question. — squiguy, Aug 29 '13 at 03:35
Use [numpy](http://www.numpy.org/) and never loop over a numpy ndarray unless you've searched the documentation, consulted with other programmers, and determined that what you're trying to do cannot be done with vectorized builtins. — user2357112, Aug 29 '13 at 03:49
From my brief exploration of numpy, their documentation seems to be extremely lacking. Do you have any tips about how, specifically, I could implement numpy-driven performance boosts on this sort? — reem, Aug 29 '13 at 04:12
Getting rid of the branch statements by using try catch blocks might provide a small speed boost. — smac89, Aug 29 '13 at 04:15
Changed to try/except block, actually lead to a pretty big performance boost. Now I get around 0.43 seconds instead of 0.48 seconds on a million-integer input. Updating main post. — reem, Aug 29 '13 at 04:24
Instead of try/except, just use a `defaultdict(int)` and unconditionally do `counts[num] += 1`. — user2357112, Aug 29 '13 at 04:35
(Also, note that `except Exception, e` is a bad way to catch that exception. `except KeyError` would be better, as it specifies the type of error you're looking for and lets other exceptions propagate, and it doesn't save a reference to an exception object we don't care about.) — user2357112, Aug 29 '13 at 04:43
Changed exception block and used chain and repeat from itertools to speed up building the final sorted list. Now runs on average .2 seconds for 1 million random integers. — reem, Aug 29 '13 at 15:12

score 8 · Accepted Answer · answered Aug 29 '13 at 04:31

8

With numpy, this function reduces to the following:

def countsort(unsorted):
    unsorted = numpy.asarray(unsorted)
    return numpy.repeat(numpy.arange(1+unsorted.max()), numpy.bincount(unsorted))

This ran about 40 times faster when I tried it on 100000 random ints from the interval [0, 10000). bincount does the counting, and repeat converts from counts to a sorted array.

answered Aug 29 '13 at 04:31

user2357112

260,549
28
431
505

This is the key: Anything you can get the interpreter to do in native code (rather than in python instructions) is going to be vastly faster. – Ben Jackson Aug 29 '13 at 04:34
I am totally blown away. This is awesome, runs twice as fast as my current fastest work, more than four times as fast as the built in sort - 1.5 seconds on 10 million integers. Well done sir. Is there any way to get it to return an ordinary list? – reem Aug 29 '13 at 04:43
You could call the output's [`tolist`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tolist.html) method, but depending on how much math you're doing, it might be better to keep the array. (My timing data used a numpy array as input for the numpy version and a list as input for the no-numpy version.) – user2357112 Aug 29 '13 at 04:48
Just ran this on 100,000,000 integers, 42 seconds as opposed to timsort's 110. On 1 million integers it ran in 0.014 s as opposed to timsort's 0.58. 41 times faster, that is some really really serious performance. – reem Aug 29 '13 at 05:08

SethMMorton · Answer 2 · 2013-08-29T05:34:47.120

1

Without thinking about your algorithm, this will help get rid of most of your pure python loops (which are quite slow) and turning them into comprehensions or generators (always faster than regular for blocks). Also, if you have to make a list consisting of all the same elements, the [x]*n syntax is probably the fastest way to go. The sum is used to flatten the list of lists.

from collections import defaultdict

def countsort(unsorted_list):
    lmin, lmax = min(unsorted_list), max(unsorted_list) + 1
    counts = defaultdict(int)
    for j in unsorted_list:
        counts[j] += 1
    return sum([[num]*counts[num] for num in xrange(lmin, lmax) if num in counts])

Note that this is not vectorized, nor does it use numpy.

edited Aug 29 '13 at 05:34

answered Aug 29 '13 at 04:21

SethMMorton

45,752
12
65
86

Hmm. sorted_list += [num]*counts[num] is significantly (more than twice as) fast as the way I was doing it, but strangely using the Counter class slows down execution immensely. Mixing and matching with a try/catch block and your method for extending the sorted_list we can bring down time to under .35 seconds on average! (Timsort gets about .55!) – reem Aug 29 '13 at 04:31
@sortfiend Does using a defaultdict work faster than Counter or the `try/except` block? – SethMMorton Aug 29 '13 at 04:39
I have not tried defaultdict, but my feeling is that would require allocating the entire range twice, which seems significantly slower. I will test it out and come back. – reem Aug 29 '13 at 05:02
OK. Either way, my solution will never beat the numpy solution. I just wanted to offer a pure python version in case you didn't have access to numpy. – SethMMorton Aug 29 '13 at 05:15
defaultdict is just a tiny bit slower than try/except. The list comprehension for the return value is just a tiny bit faster, so I will be changing to that. However, I'm having a tiny bit a problem. When I drop the list comprehension format (with lmin and lmax changed so that they are defined in the comprehension) the list is sorted, but each number is in its own list. Like: [[1, 1], [2, 2, 2]] etc. How do I fix that? I'm still curious to work on a python solution because it is interesting. But yes, that numpy solution is crazy. – reem Aug 29 '13 at 05:23
On double-triple checking, defaultdict is actually faster. That makes your solution the fastest setup yet! EDIT: Scratch that again. Defaultdict is faster for smaller arrays (under a million) but the old version is faster for larger arrays. – reem Aug 29 '13 at 05:32
I should have realized you would get the list of lists... I think you will have to go back to the `+=` syntax. OR, you could put a `sum` around the list comprehension (which will flatten a list of lists), but this will just be some extra overhead. – SethMMorton Aug 29 '13 at 05:32
I tried to fix it by doing another comprehension on the new list, but that slows it down too much. It's a shame, because it is faster than another try/except block by about 10%. – reem Aug 29 '13 at 05:35
Oh man. If you have 1 million elements, forget about this solution and just go with numpy. – SethMMorton Aug 29 '13 at 05:36
A million elements runs in .3 seconds, but I'm not doing this because I actually need to sort a million elements, I'm just doing it because I'm doing it. I'm pretty satisfied with a 40% gain over Timsort without numpy though, so I'll put it to rest. – reem Aug 29 '13 at 05:39

How can I vectorize this python count sort so it is absolutely as fast as it can be?

2 Answers2

Linked