1

I'm writing a simple in-place introsort in C++, in which I'm trying to manually unroll a loop within the partition function for the sake of optimization. The program, which I'll include below, compiles but isn't able to sort a random list correctly.

This program is being compiled down for RISC-V architecture, and even under -Ofast, (riscv-64-unknown-elf-gcc) gcc doesn't seem to be unrolling the loop on its own, making a manual check every cycle through to ensure the end condition is met. I'd like to space this check out to try and maximize performance - it's my understanding that some compilers wind up doing this by default.

I've tried breaking this loop up into chunks of 5, to prove the concept before I go further (perhaps with multiple segments, e.g. try going through groups of 32 then try going through groups of 16 etc.), then doing the last few elements of the array as I have previously. Before unrolling the program worked fine, but now the sort fails and I'm not sure how to proceed.

Here's the partition function in question:

int* partition(int *startptr, int *endptr) {
    int x = *endptr; // threshold
    int *j, tmp, tmp2, *i = startptr - 1;
    for (j = startptr; j+5 < endptr; j+=5) {

        int pj = *j;
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp;
        }

        pj = j[1];
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp; }

        pj = j[2];
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp; }

        pj = j[3];
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp; }

        pj = j[4];
        if (pj <= x) {
            i += 1;
            tmp = *i;
            *i = pj;
            *j = tmp; }
        }

    j -= 5; 
    for (int *y = j; y < endptr; y++) {
        int py = y[0];
        if (py <= x) {
            i += 1;
            tmp = *i;
            *i = py;
            *y = tmp;
            } 
        }

    int *incrementedi = i + 1;
    tmp = *incrementedi;   //p[i+1]
    tmp2 = *endptr; //p[end]
    *endptr = tmp;
    *incrementedi = tmp2;
    return i + 1;
 }

At the end of the program, I print out the array and loop through, asserting that's it in ascending order as anticipated. The output appears sorted in small chunks, but it's not fully accurate, and I'm not sure how to proceed. Thank you!


Edit for clarification: I'm verifying that the loop is not in fact unrolling by looking at the output of ...-gcc -S. The partition function is inlining nicely but it still performs the check over every iteration.

It's worth noting that I'm using pointers whenever possible for a similar reason - the compiler isn't optimizing for the instruction savings we get when we don't have to convert array indices to actual pointers.

jaytlang
  • 23
  • 3
  • I'm not a big fan of template metaprogramming, but then again I'm not a big fan of manual optimizations either. Isn't this one of those cases where you may want to use templates to let the compiler generate this _for_ you? – CompuChip May 15 '19 at 15:38
  • I've heard of this here and there and it seems like a great idea - I'm really unfamiliar with the concept/implementation though. I'll research further but how might one go about creating a template for this? – jaytlang May 15 '19 at 15:42
  • I'd like to give an example but I'm afraid things may get mixed up. Perhaps we should use this question to solve the problem with the loop in the first place; then if you have working code you can post a separate question here (or on CodeReview) and I'll be happy to provide a templated version. – CompuChip May 15 '19 at 15:45
  • Okay! Once I'm able to get this proof of concept working and see how it impacts instruction counts, I'll ask about templating and see how much further we can go with it. Thanks! – jaytlang May 15 '19 at 15:51
  • I couldn't leave it alone, so I did it anyway, sorry :-) https://godbolt.org/z/M4b050 – CompuChip May 16 '19 at 12:03
  • That's awesome - and it seems really simple to write too. Thank you! Will try later today and see it in action with the RISC compiler. – jaytlang May 18 '19 at 13:43

1 Answers1

0

This example code works, about 11% faster in 64 bit mode (more registers). The compiler optimized the compare and conditional copy of pj[...] via tmp to use a register (and it cycled through registers to allow some overlap).

int * Partition(int *plo, int *phi)
{
    int *pi = plo;
    int *pj = plo;
    int pvt = *phi;
    int tmp;
    int *ph8 = phi - 8;
    for (pj = plo; pj < ph8; pj += 8)
    {
        if (pj[0] < pvt)
        {
            tmp = pj[0];
            pj[0] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[1] < pvt)
        {
            tmp = pj[1];
            pj[1] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[2] < pvt)
        {
            tmp = pj[2];
            pj[2] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[3] < pvt)
        {
            tmp = pj[3];
            pj[3] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[4] < pvt)
        {
            tmp = pj[4];
            pj[4] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[5] < pvt)
        {
            tmp = pj[5];
            pj[5] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[6] < pvt)
        {
            tmp = pj[6];
            pj[6] = *pi;
            *pi = tmp;
            ++pi;
        }
        if (pj[7] < pvt)
        {
            tmp = pj[7];
            pj[7] = *pi;
            *pi = tmp;
            ++pi;
        }
    }
    for (; pj < phi; ++pj)
    {
        if (*pj < pvt)
        {
            tmp = *pj;
            *pj = *pi;
            *pi = tmp;
            ++pi;
        }
    }
    tmp  = *phi;
    *phi = *pi;
    *pi  = tmp;
    return pi;
}

void QuickSort(int *plo, int *phi)
{
int *p;
    if (plo < phi)
    {
        p = Partition(plo, phi);
        QuickSort(plo, p-1);
        QuickSort(p+1, phi);
    }
}
rcgldr
  • 27,407
  • 3
  • 36
  • 61
  • Thank you -- this runs well! The loop unrolled nicely, and for my benchmarking array I saw a decrease from 228,000 cycles to 188,000 cycles to sort. This is incredibly helpful! – jaytlang May 15 '19 at 16:42
  • @jaytlang - This is Lomuto partition, which is OK for random data, but if there are a lot of duplicates or if the data is significantly pre-ordered (or reversed), [Hoare partition](https://en.wikipedia.org/wiki/Quicksort#Hoare_partition_scheme) will be faster, and there isn't any point in unfolding it's tight loops, as there a conditional branch involved either way. – rcgldr May 15 '19 at 21:31