vectorized radix sort with numpy - can it beat np.sort?

Question

Numpy doesn't yet have a radix sort, so I wondered whether it was possible to write one using pre-existing numpy functions. So far I have the following, which does work, but is about 10 times slower than numpy's quicksort.

Test and benchmark:

a = np.random.randint(0, 1e8, 1e6)
assert(np.all(radix_sort(a) == np.sort(a))) 
%timeit np.sort(a)
%timeit radix_sort(a)

The mask_b loop can be at least partially vectorized, broadcasting out across masks from &, and using cumsum with axis arg, but that ends up being a pessimization, presumably due to the increased memory footprint.

If anyone can see a way to improve on what I have I'd be interested to hear, even if it's still slower than np.sort...this is more a case of intellectual curiosity and interest in numpy tricks.

Note that you can implement a fast counting sort easily enough, though that's only relevant for small integer data.

Edit 1: Taking np.arange(n) out of the loop helps a little, but that's not very exiciting.

Edit 2: The cumsum was actually redundant (ooops!) but this simpler version only helps marginally with performance..

def radix_sort(a):
    bit_len = np.max(a).bit_length()
    n = len(a)
    cached_arange = arange(n)
    idx = np.empty(n, dtype=int) # fully overwritten each iteration
    for mask_b in xrange(bit_len):
        is_one = (a & 2**mask_b).astype(bool)
        n_ones = np.sum(is_one)      
        n_zeros = n-n_ones
        idx[~is_one] = cached_arange[:n_zeros]
        idx[is_one] = cached_arange[:n_ones] + n_zeros
        # next three lines just do: a[idx] = a, but correctly
        new_a = np.empty(n, dtype=a.dtype)
        new_a[idx] = a
        a = new_a
    return a

Edit 3: rather than loop over single bits, you can loop over two or more at a time, if you construct idx in multiple steps. Using 2 bits helps a little, I've not tried more:

idx[is_zero] = np.arange(n_zeros)
idx[is_one] = np.arange(n_ones)
idx[is_two] = np.arange(n_twos)
idx[is_three] = np.arange(n_threes)

Edits 4 and 5: going to 4 bits seems best for the input I'm testing. Also, you can get rid of the idx step entirely. Now only about 5 times, rather than 10 times, slower than np.sort (source available as gist):

Edit 6: This is a tidied up version of the above, but it's also a tiny bit slower. 80% of the time is spent on repeat and extract - if only there was a way to broadcast the extract :( ...

def radix_sort(a, batch_m_bits=3):
    bit_len = np.max(a).bit_length()
    batch_m = 2**batch_m_bits
    mask = 2**batch_m_bits - 1
    val_set = np.arange(batch_m, dtype=a.dtype)[:, nax] # nax = np.newaxis
    for _ in range((bit_len-1)//batch_m_bits + 1): # ceil-division
        a = np.extract((a & mask)[nax, :] == val_set,
                        np.repeat(a[nax, :], batch_m, axis=0))
        val_set <<= batch_m_bits
        mask <<= batch_m_bits
    return a

Edits 7 & 8: Actually, you can broadcast the extract using as_strided from numpy.lib.stride_tricks, but it doesn't seem to help much performance-wise:

Initially this made sense to me on the grounds that extract will be iterating over the whole array batch_m times, so the total number of cache lines requested by the CPU will be the same as before (it's just that by the end of the process it has request each cache line batch_m times). However the reality is that extract is not clever enough to iterate over arbitrary stepped arrays, and has to expand out the array before beginning, i.e. the repeat ends up being done anyway. In fact, having looked at the source for extract, I now see that the best we can do with this approach is:

a = a[np.flatnonzero((a & mask)[nax, :] == val_set) % len(a)]

which is marginally slower than extract. However, if len(a) is a power of two we can replace the expensive mod operation with & (len(a) - 1), which does end up being a bit faster than the extract version (now about 4.9x np.sort for a=randint(0, 1e8, 2**20). I suppose we could make this work for non-power of two lengths by zero-padding, and then cropping the extra zeros at the end of the sort...however this would be a pessimisation unless the length was already close to being power of two.

Given that you're asking about optimizing working code, your question might be better suited to [CodeReview.SE](http://codereview.stackexchange.com). — ali_m, Dec 01 '15 at 15:38
The answers this question has already received are good examples of why this question would be a better fit on CR; you're getting *suggestions*, not answers. — DSM, Dec 01 '15 at 17:03
I'm hesitant to recommend `CodeReview` because there are a lot more `numpy` eyes on `Stackoverflow`. But for coding style and pure Python issues, CR is good. — hpaulj, Dec 01 '15 at 18:28
@hpaulj you know, the best way to grow a community isn't to *avoid* it. — Mathieu Guindon, Dec 01 '15 at 18:32
I'm slowly building my rep there. I check a couple of times a week to see if there are any forlorn numpy questions. :) — hpaulj, Dec 01 '15 at 19:02
Not to put a damper on things, but we should look at this in context. `numpy sort` is implemented in C using a sort that is even more efficient than C's own `qsort` (it uses code generation techniques with macros to avoid the overhead of dynamic dispatch for the comparator that `qsort` requires, and `qsort` is already a speed demon). It's going to require a herculean feat to beat it without likewise reaching for native code. While radix sort has linear complexity, the binning tends to hit memory-associated hotspots. You might have some chance if you sort in sub-ranges and merge them together. — , Dec 01 '15 at 19:21
@dan-man That's getting there. Maybe you can manage it! It seemed kind of hopeless to me conceptually to use numpy to beat numpy's own native implementation -- but if you can manage it, I'd be most interested in the results. — , Dec 01 '15 at 19:40
@Ike = I guess you could do something with batches: make `counts` a 2d array, and then nest the `ii` loop within another loop over the batches..I might give it a try later! — dan-man, Dec 01 '15 at 19:46
It looks like you're losing a lot of time going over `a` so many times for `a[a_shifted_masked==ii]` when you'd ideally just make one pass. (Oh, man, I remember that counting sort answer! Good times.) — user2357112, Dec 01 '15 at 19:50
Would `a[a_shifted_masked.argsort(kind='mergesort')]` be cheating? It's much further from optimal than what we'd do if we were working in something like Cython, but with pure NumPy, it seems like the easiest way to shift the elements of `a` where they need to go. — user2357112, Dec 01 '15 at 19:58
Other than `argsort`, NumPy doesn't really have any way to do a groupby-like operation. I've seen a [NumPy Enhancement Proposal](http://docs.scipy.org/doc/numpy-1.10.1/neps/groupby_additions.html) that would make this kind of thing pretty straightforward, but other than that, it looks like it's argsort and argsort wrappers or nothing. If you have Pandas, you could use its groupby functionality. I think that's written in Cython. — user2357112, Dec 01 '15 at 20:16
@user2357112 - I'm coauthor of [numpy-groupies](https://github.com/ml31415/numpy-groupies), which does groupby operations using various different backends, including pandas and numpy. Using `argsort` here is definitely cheating because it is a *sort* (and presumably will be slower due to full sort each iteration). — dan-man, Dec 02 '15 at 12:48
`scipy.sparse.coo.coo_tocsr` does the exact loop you need for radix sort in C... Maybe under the right conditions it could be faster than `np.sort`. — , Dec 03 '15 at 14:43
@morningsun - could you elaborate? I'd be interested to see what exactly you had in mind. — dan-man, Dec 03 '15 at 15:14
I'd be interested in a comparison of worst-case performances between the two. — AndyG, Dec 04 '15 at 15:03
You could simply use one of the existing C or C++ implementations, such as example, `integer_sort` from [Boost.Sort](http://www.boost.org/doc/libs/1_60_0/libs/sort/doc/html/) or `u4_sort` from [usort](https://bitbucket.org/ais/usort). It is surprisingly easy to call native C or C++ code from Python, see [How to sort an array of integers faster than quicksort?](http://stackoverflow.com/q/35317442/341970) I will update that GitHub issue you refer to, and let the NumPy developers know that they could simply grab one of the existing implementations; licensing should not be an issue. — Ali, Feb 10 '16 at 14:22

score 4 · Answer 1 · 2015-12-04T14:43:10.450

I had a go with Numba to see how fast a radix sort could be. The key to good performance with Numba (often) is to write out all the loops, which is very instructive. I ended up with the following:

from numba import jit

@jit
def radix_loop(nbatches, batch_m_bits, bitsums, a, out):
    mask = (1 << batch_m_bits) - 1
    for shift in range(0, nbatches*batch_m_bits, batch_m_bits):
        # set bit sums to zero
        for i in range(bitsums.shape[0]):
            bitsums[i] = 0

        # determine bit sums
        for i in range(a.shape[0]):
            j = (a[i] & mask) >> shift
            bitsums[j] += 1

        # take the cumsum of the bit sums
        cumsum = 0
        for i in range(bitsums.shape[0]):
            temp = bitsums[i]
            bitsums[i] = cumsum
            cumsum += temp

        # sorting loop
        for i in range(a.shape[0]):
            j = (a[i] & mask) >> shift
            out[bitsums[j]] = a[i]
            bitsums[j] += 1

        # prepare next iteration
        mask <<= batch_m_bits
        # cant use `temp` here because of numba internal types
        temp2 = a
        a = out
        out = temp2

    return a

From the 4 inner loops, it's easy to see it's the 4th one making it hard to vectorize with Numpy.

One way to cheat around that problem is to pull in a particular C++ function from Scipy: scipy.sparse.coo.coo_tocsr. It does pretty much the same inner loops as the Python function above, so it can be abused to write a faster "vectorized" radix sort in Python. Maybe something like:

from scipy.sparse.coo import coo_tocsr

def radix_step(radix, keys, bitsums, a, w):
    coo_tocsr(radix, 1, a.size, keys, a, a, bitsums, w, w)
    return w, a

def scipysparse_radix_perbyte(a):
    # coo_tocsr internally works with system int and upcasts
    # anything else. We need to copy anyway to not mess with
    # original array. Also take into account endianness...
    a = a.astype('<i', copy=True)
    bitlen = int(a.max()).bit_length()
    radix = 256
    work = np.empty_like(a)
    _ = np.empty(radix+1, int)
    for i in range((bitlen-1)//8 + 1):
        keys = a.view('u1')[i::a.itemsize].astype(int)
        a, work = radix_step(radix, keys, _, a, work)
    return a

^{EDIT: Optimized the function a little bit.. see edit history.}

One inefficiency of LSB radix sorting like above is that the array is completely shuffled in RAM a number of times, which means the CPU cache isn't used very well. To try to mitigate this effect, one could opt to first do a pass with MSB radix sort, to put items in roughly the right block of RAM, before sorting every resulting group with a LSB radix sort. Here's one implementation:

def scipysparse_radix_hybrid(a, bbits=8, gbits=8):
    """
    Parameters
    ----------
    a : Array of non-negative integers to be sorted.
    bbits : Number of bits in radix for LSB sorting.
    gbits : Number of bits in radix for MSB grouping.
    """
    a = a.copy()
    bitlen = int(a.max()).bit_length()
    work = np.empty_like(a)

    # Group values by single iteration of MSB radix sort:
    # Casting to np.int_ to get rid of python BigInt
    ngroups = np.int_(2**gbits)
    group_offset = np.empty(ngroups + 1, int)
    shift = max(bitlen-gbits, 0)
    a, work = radix_step(ngroups, a>>shift, group_offset, a, work)
    bitlen = shift
    if not bitlen:
        return a

    # LSB radix sort each group:
    agroups = np.split(a, group_offset[1:-1])
    # Mask off high bits to not undo the grouping..
    gmask = (1 << shift) - 1
    nbatch = (bitlen-1) // bbits + 1
    radix = np.int_(2**bbits)
    _ = np.empty(radix + 1, int)
    for agi in agroups:
        if not agi.size:
            continue
        mask = (radix - 1) & gmask
        wgi = work[:agi.size]
        for shift in range(0, nbatch*bbits, bbits):
            keys = (agi & mask) >> shift
            agi, wgi = radix_step(radix, keys, _, agi, wgi)
            mask = (mask << bbits) & gmask
        if nbatch % 2:
            # Copy result back in to `a`
            wgi[...] = agi
    return a

Timings (with best performing settings for each on my system):

def numba_radix(a, batch_m_bits=8):
    a = a.copy()
    bit_len = int(a.max()).bit_length()
    nbatches = (bit_len-1)//batch_m_bits +1
    work = np.zeros_like(a)
    bitsums = np.zeros(2**batch_m_bits + 1, int)
    srtd = radix_loop(nbatches, batch_m_bits, bitsums, a, work)
    return srtd

a = np.random.randint(0, 1e8, 1e6)
%timeit numba_radix(a, 9)
# 10 loops, best of 3: 76.1 ms per loop
%timeit np.sort(a)
#10 loops, best of 3: 115 ms per loop
%timeit scipysparse_radix_perbyte(a)
#10 loops, best of 3: 95.2 ms per loop
%timeit scipysparse_radix_hybrid(a, 11, 6)
#10 loops, best of 3: 75.4 ms per loop

Numba performs very well, as expected. And also with some clever application of existing C-extensions it's possible to beat numpy.sort. IMO at the level of optimization you've already gotten it's worth-it to also consider add-ons to Numpy, but I wouldn't really consider the implementations in my answer "vectorized": The bulk of the work is done in a external dedicated function.

One other thing that strikes me is the sensitivity to the choice of radix. For most of the settings I tried my implementations were still slower than numpy.sort, so in practice some sort of heuristic would be required to offer good performance across the board.

I did briefly wonder whether there was something relevant in `scipy.sparse` but I'm not really familiar with it myself...anyway, good spot! Also, I think you must be right that MSB is the way to go in terms of cache locality (I'd gotten so hooked on LSB I didn't investigate the alternative). Maybe I'll give that a go. I'm reluctant to mark this as the "correct" answer yet because the numba solution doesn't count and although the scipy solution beats mine there's a hint that it's possible to do better..and there's still the goal of `np.sort`. — dan-man, Dec 03 '15 at 23:15
@dan-man - Thanks. Adding an MSB sort into the mix did help somewhat, but in terms of amount code it gives diminishing returns lol. I don't think this is the "correct" answer either, but at least I had a good time working on it :) — , Dec 04 '15 at 14:39
I'd be interested in comparing worst-case (or "bad case") performances between the two. — AndyG, Dec 04 '15 at 15:04

rcgldr · Answer 2 · 2015-12-01T17:21:20.723

Can you change this to be a counting / radix sort that works 8 bits at a time? For 32 bit unsigned integers, create a matrix[4][257] of counts of occurrence of byte fields, making one read pass over the array to be sorted. matrix[][0] = 0, matrix[][1] = # of occurences of 0, ... . Then convert the counts into indexes, where matrix[][0] = 0, matrix[][1] = # of bytes == 0, matrix[][2] = # of bytes == 0 + # of bytes == 1, ... . The last count is not used, since that would index the end of the array. Then do 4 passes of radix sort, moving data back and forth between the original array and the output array. Working 16 bits at time would need a matrix[2][65537], but only take 2 passes. Example C code:

size_t mIndex[4][257] = {0};            /* index matrix */
size_t i, j, m;
uint32_t u;
uint32_t *pData;                        /* ptr to original array */
uint32_t *pTemp;                        /* ptr to working array */
uint32_t *pSrc;                         /* working ptr */
uint32_t *pDst;                         /* working ptr */
/* n is size of array */
    for(i = 0; i < n; i++){             /* generate histograms */
        u = pData[i];
        for(j = 0; j < 4; j++){
            mIndex[j][1 + (size_t)(u & 0xff)]++; /* note [1 + ... */
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             /* convert to indices */
        for(i = 1; i < 257; i++){       /* (last count never used) */
            mIndex[j][i] += mIndex[j][i-1]
        }       
    }
    pDst = pTemp;                       /* radix sort */
    pSrc = pData;
    for(j = 0; j < 4; j++){
        for(i = 0; i < count; i++){     /* sort pass */
            u = pSrc[i];
            m = (size_t)(u >> (j<<3)) & 0xff;
        /*  pDst[mIndex[j][m]++] = u;      split into 2 lines */
            pDst[mIndex[j][m]] = u;
            mIndex[j][m]++;
        }
        pTmp = pSrc;                    /* swap ptrs */
        pSrc = pDst;
        pDst = pTmp;
    }

I still don't see how to do it vectorized...try writing that last loop in numpy/matlab rather than in C. — dan-man, Dec 01 '15 at 17:39
@dan-man - I'm not sure how this could be vectorized, but wondering if the conventional iterative method for radix sort is going to be faster than what your'e using now. — rcgldr, Dec 01 '15 at 18:02

vectorized radix sort with numpy - can it beat np.sort?

2 Answers2

Linked