This sounds like a signal processing kernel of some sort. It's hard to help answer this without knowing the exact data flow in your design. Any algorithm involving a sort has a address decoding cost since you'll need to be able to write and read a memory of 81-elements. If this data is in a memory this cost has already been paid but if it is in distinct registers then writing to them carries an area cost.
Assuming the data is in a memory, you could use bubble sort and take bottom 16 values. The below code assumes a two-port memory but it could work with a single port by alternating reads and writes on every clock cycle using a temporary register to store the write value and write index. This may not be more area efficient with only 81 elements in the memory.
Alternatively the source memory could be implemented as two single-port memories with one having odd indices and the other even indices.
// Not working code
reg [15:0] inData [80:0]; // Must be two-port
reg [3:0] iterCount = 0;
reg [6:0] index = 0;
reg sorting;
always @(posedge clk)
begin
// Some condition to control execution
if(sorting)
begin
if(index == 80)
begin
// Stop after 16 iterations
if(iterCount == 15)
begin
sorting <= 0;
iterCount <= 0;
end
else
begin
iterCount <= iterCount+1;
index <= 0;
end
end
else
begin
// If this is smaller than next item
if(inData[index] < inData[index+1])
begin
// Swap with next item
inData[index+1] <= inData[index];
inData[index] <= inData[index+1];
end
index <= index + 1;
end
end
end
EDIT: If you're latency constrained, allowed only one clock domain and must pipeline then the problem is limited to selecting a sorting network and mapping it to a pipeline. You can't use resource sharing so area is fixed given your sorting network.
- Select a sorting network(Bitonic, Pairwise,etc) for N=81 (Not easy)
- Build a directed data flow graph representing the sorting network
- Remove all nodes except those required for outputs 66-81
- Determine the latency L of one comparison node
- Use ALAP scheduling to assign M nodes per clock where M*L < 1/f
- If scheduling succeeds code it up in an HDL