Fast, small area and low latency partial sorting algorithm

Question

I'm looking for a fast way to do a partial sort of 81 numbers - Ideally I'm looking to extract the lowest 16 values (its not necessary for the 16 to be in the absolutely correct order).

The target for this is dedicated hardware in an FPGA - so this slightly complicated matters as I want the area of the resultant implementation as small as possible. I looked at and implemented the odd-even merge sort algorithm, but I'm ideally looking for anything that might be more efficient for my needs (trade algorithm implementation size for a partial sort giving lowest 16, not necessarily in order as opposed to a full sort)

Any suggestions would be very welcome

Many thanks

Do you need the lowest 16 values, or are you ok if there are some bigger values in there? — Georg Schölly, Nov 09 '12 at 10:34
I'm ok with some larger values - I might just take the lowest 24 values to compensate for this — trican, Nov 09 '12 at 10:40
Since you are sorting for a few number of elements, you should take counting sort into account, which costs you in nearly linear time. — ylc, Nov 09 '12 at 10:56
Could you elaborate on this point ylc - I'm not sure what you mean? thanks — trican, Nov 09 '12 at 10:58
I mean the sorting algorithm [Counting sort](http://en.wikipedia.org/wiki/Counting_sort). — ylc, Nov 09 '12 at 11:00
How big are the numbers? If they are small, a radix sort with a radix of 16 might be good for you. I used such a sort for a list of less than 256 numbers of small size (short ints). It required only 16 buckets, which fit into a SSE register, which almost completely got rid of the initialization overhead. — Gunther Piez, Nov 09 '12 at 11:21
yes the numbers are small (less than 16bit). But its dedicated hardware so SSE optimisations are not of any use to me. Looking up radix sort now though! — trican, Nov 09 '12 at 11:31
You title asks for fast but your question asks for minimum area. Do you need to know the index of the min values, or just the min values? Is all the data available at once? — , Nov 09 '12 at 19:16
@Adam12 ideally both, but area is more important to me, that said I dont want too many pipeline stages <16 at most. I dont need to know the index of the minimum value. I'm happy to get the lowest 16 values — trican, Nov 10 '12 at 09:28
What do mean by 16 pipeline stages? Pipeline stage are added to resolve timing issues. Do you mean the algorithm must complete in 16 cycles of latency? That completely changes the problem. — , Nov 11 '12 at 20:41
yes @Adam12 thats what I meant, I want relatively low latency - ideally maybe around 16 cycles , definitely not hundreds of clock cycles anyway. — trican, Nov 11 '12 at 21:04
Sorting 81 values in 16 or less cycles is a challenging problem, even if only partially sorted. I believe it could be done, for instance divide the values in 4 parts each of 4 bit, count each of those 4 bit values of all 81 input values in parallel using a "less or equal" and sort into the resulting indices, repeat everything four times, done ;-) — Gunther Piez, Nov 12 '12 at 14:56
Your question seems to really be "I need a high-N min() algorithm with a fixed latency". — , Nov 12 '12 at 17:59

score 8 · Answer 1 · edited Nov 12 '12 at 12:36

8

This can be done in O(nlog_k) time, where in your case n = 81 and k = 16. This is much efficiently when you dealing with big number of elements than O(nlog_n).

Algorithm is following:

Init max heap data structure. This heap will contain your top 16. Size of this heap will not exceed 16 and it has log(k) for insertion and deletion
Iterate through list of n - numbers. For each n:
- if heap has less than 16 elements, just add it
- if heap has 16 elements, compare n with max from heap (if it is greater than max just skip, if less than max, remove max and add n.)

This way at every iteration you keep tracking smallest k numbers from currently processed part of list.

edited Nov 12 '12 at 12:36

Martin Thompson

16,395
1
38
56

answered Nov 09 '12 at 10:49

mishadoff

10,719
2
33
55

You should use a heap and not a sorted list because the sorted list does not have O(log k) for insertion. – Thomash Nov 09 '12 at 10:57
thanks - again this sounds interesting - though I think currently the quicksort would be better for a hardware implementation. I'll need to investigate further – trican Nov 09 '12 at 10:57
@trican: This algorithm should be pretty simple for a hardware implementation, particularly if you can do a couple of comparisons in parallel. First you do make_heap on the first 16 elements of the array (which is actually eight heap operations). Then you go through the remaining 81-16 elements, ignoring any which are greater than element 0, and otherwise replacing element 0 and reheapifying (which is the same operation as the last one in the make_heap.) This way your heap size is fixed and the individual operations are simplified. – rici Nov 09 '12 at 15:30

score 2 · Answer 2 · 2012-11-12T18:52:59.733

This sounds like a signal processing kernel of some sort. It's hard to help answer this without knowing the exact data flow in your design. Any algorithm involving a sort has a address decoding cost since you'll need to be able to write and read a memory of 81-elements. If this data is in a memory this cost has already been paid but if it is in distinct registers then writing to them carries an area cost.

Assuming the data is in a memory, you could use bubble sort and take bottom 16 values. The below code assumes a two-port memory but it could work with a single port by alternating reads and writes on every clock cycle using a temporary register to store the write value and write index. This may not be more area efficient with only 81 elements in the memory. Alternatively the source memory could be implemented as two single-port memories with one having odd indices and the other even indices.

// Not working code 
reg [15:0] inData [80:0]; // Must be two-port
reg [3:0]  iterCount = 0;
reg [6:0]  index = 0;
reg sorting;

always @(posedge clk)
  begin
  // Some condition to control execution
  if(sorting)
    begin

    if(index == 80)
      begin 

      // Stop after 16 iterations
      if(iterCount == 15)
        begin
        sorting <= 0;
        iterCount <= 0;
        end
      else
        begin
        iterCount <= iterCount+1;
        index <= 0;
        end
      end 
    else
      begin
      // If this is smaller than next item
      if(inData[index] < inData[index+1])
        begin
        // Swap with next item
        inData[index+1] <= inData[index];
        inData[index]   <= inData[index+1];
        end
      index <= index + 1;
      end
    end
  end

EDIT: If you're latency constrained, allowed only one clock domain and must pipeline then the problem is limited to selecting a sorting network and mapping it to a pipeline. You can't use resource sharing so area is fixed given your sorting network.

Select a sorting network(Bitonic, Pairwise,etc) for N=81 (Not easy)
Build a directed data flow graph representing the sorting network
Remove all nodes except those required for outputs 66-81
Determine the latency L of one comparison node
Use ALAP scheduling to assign M nodes per clock where M*L < 1/f
If scheduling succeeds code it up in an HDL

Many thanks for this suggestion Adam. Whilst I appreciate this approach probably represents the minimal area and probably has a very high maximum frequency, my concern however is the latency is hundreds (or possibly thousands) of clock cycles - which I cant afford. As you suspected this is for a signal processing application, so I get a new group of 81 values to sort every clock cycle. Thus efficient pipelining (in the sense of multiple groups are being sorted at once) with a short latency is vital. — trican, Nov 11 '12 at 21:52
An even odd merge sort or bitonic sort would have much shorter latency than a bubble sort and also has a very regular data flow for implementation purposes. Of course the trade off is much higher area - its delicate balancing act I suppose — trican, Nov 11 '12 at 21:54

score 1 · Answer 3 · answered Nov 09 '12 at 10:40

If you're looking for general algorithms, then you could use a version of Quicksort. Quicksort sorts values around a pivot element. You choose a pivot and sort your array. You then get x values < pivot and (n-x-1) larger ones. Depending on x, you choose either one array to continue processing. If x>16, then you know that the numbers your looking for are all in the x-array and you can continue sorting that. If not, then you know x lowest and can now look for the 16-x others in the other array recursively.

The resulting arrays of quicksort are not fully sorted, you only know that they are smaller or larger than your pivot. Some info at wikipedia and a paper.

This sounds like a very interesting approach - I will definitely read the linked paper and have a think about its hardware implementation - I would need to use a fixed number of iterations to keep it amendable for hardware. But I suspect that wouldn't compromise the result too much — trican, Nov 09 '12 at 10:48

score 0 · Answer 4 · answered Nov 09 '12 at 10:44

First set an array with 16 values. And use something like selection sort:

You think the first value is the minimal in the array.
Compare the value of the first element, second, until you find a number lower than it.
Take that as your new minimal, and compare it with the next numbers until you find a number lower than it.
If you finished the array, the number you have is your minimal, save it in an array and exclude it from the source array, and start over again until you fill the 16 positions of the solution array.

score 0 · Answer 5 · answered Dec 18 '22 at 13:42

If you want to avoid state machines and loop-indexing (with associated large muxes or RAMs) then a hardware friendly approach might be to use a sorting network for 17 items (yuck!). Feed this with 16 items from registers (initially empty) and 1 item from the input bus. For each valid cycle on the input bus, the output of the sorting network is the smallest 16 seen so far with the new element either within the results or at the top (if it is larger). The smallest 16 then latch into the registers for the next cycle and the top output from the sorting network is ignored.

This will require 16*(n-bits per element) registers, and a sorting network that is 5 subtract/compare blocks deep and 8 wide, for 40 subtract/complete blocks total (each with a worst case propagation of n-bits). Can be pipelined between stages trivially if you can back pressure the input bus. The solution is almost constant in time and space: 1 cycle per input, and space is only a function of the number of outputs not inputs.

Fast, small area and low latency partial sorting algorithm

5 Answers5