How can I optimize this bit-packing function in C?

Question

This code sample takes an arbitrarily large number of bytes as inputs, one at the time, and maps them to a table with 32 values [0,31]. To simplify the example I used mod 32. The >> 3 operation is equivalent to division by 8. In case it is not clear, the "q" loop counts to 5 because numbers from 0 to 31 only require 5-bits of resolution.

Can this be improved in terms of speed?

      uint8_t sto[10000]; 
      uint8_t srt[32] = {<32 values in total>};
      uint8_t t,q
      
      uint64_t loop,cr = 0;
      
      for(loop = 0; loop < 10000; loop++) { 
          t = srt[loop % 32];     
          for(q = 0; q < 5; q++) {
              if(t & (0x1 << (4 - q))) 
                  sto[cr >> 3] |= (0x1 << (7 - (cr % 8)));
              cr++;
            }
      }

Won't `cr` quickly become large enough that `(cr>>3) >= 32` and therefore `sto[cr >> 3]` is reading/writing past the end of the `sto` array and invoking undefined behavior? — Jeremy Friesner, Jun 13 '23 at 02:25
You didn't initialize `sto` before reading it. This is UB. /// `t` is not used. Is some part missing? /// What is `largeNumber`? More missing code! /// How did you arrive to `1000000`? /// You ask us to optimize code, but there's so much questionable about the code. Can't optimize code that's not even correct! — ikegami, Jun 13 '23 at 02:43
That's not even addressing the real problem that real optmization is about taking a better overall approach, which requires knowledge of what is being achieved. This was not provided. /// Even if we limit ourselves to micro-optimizations, you didn't provided any information about the operating environment! What architecture do you want to optimize for? — ikegami, Jun 13 '23 at 02:45
I can think of many arbitrarily large numbers of bytes that will not fit in this buffer. — paddy, Jun 13 '23 at 02:52
@affluentbarnburner, rather than a terse code snippet being described as a `function`, post a compliable function. — chux - Reinstate Monica, Jun 13 '23 at 04:21
`if(srt[ord] & (0x1 << (4 - q))) sto[cr >> 3] |= (0x1 << (7 - (cr % 8)));` is just `sto[cr >> 3] |= (srt[ord] >> 4-q & 1) << 7-cr%8;`. But you should not be moving bits one by one. You can use shifts and bitwise operations to move multiple bits from `sto` into `srt` in groups. Ideally, `srt` would use a wider type than `uint8_t`, so more work could be done per array element. — Eric Postpischil, Jun 13 '23 at 11:09
Edit the question to provide a [mre], including a test program with some sample data so that people answering can check their work. — Eric Postpischil, Jun 13 '23 at 11:14
Why does your text say the code takes an arbitrarily large number of bytes as inputs but the code uses the array with an arbitrarily large number of bytes, `sto`, as output, not input? The code is reading from `srt` (exactly 32 bytes) and writing to `sto` (an arbitrarily large number of bytes). — Eric Postpischil, Jun 13 '23 at 11:21
The code simply repeatedly ORs a sequence of 160 bits (5•32) from `srt` into the bytes of `sto`. The primary optimization for that is clear: First reorganize the 160 bits from `srt` into contiguous bits, then OR those bits into `sto` using whole-word units for some word size—at least a byte (for maximally portable C), preferably more if `sto` can be aligned or if unaligned operations are available (depends on implementation details, available SIMD features, et cetera). — Eric Postpischil, Jun 13 '23 at 11:50

chux - Reinstate Monica · Answer 1 · 2023-06-13T03:53:33.340

3

Better optimization possible with larger code and a timing test harness.
Candidate simplifications (All minor).

t never used. Drop it

Drop code t = srt[loop % 32];.

// t = srt[loop % 32];

@affluentbarnburner, did you mean if(t & (0x1 << (4 - q))) instead of if(srt[ord] & (0x1 << (4 - q)))?

Form a mask

//for(q = 0; q < 5; q++) {
//  if(srt[ord] & (0x1 << (4 - q)))
for (unsigned mask = 1 << 4; mask; mask >>= 1) {
  if (srt[ord] & mask)

Use size_t for array indexing

  // uint64_t loop;
  size_t loop;
  for(loop = 0; loop < largeNumber; loop++) {

Use byte and bit indexes rather than a combined one

Separate (narrower) indexes may be faster.

  // uint64_t loop,cr = 0;
  size_t loop, cr_byte = 0;
  unsigned cr_bit_mask = 0x80;
  
  for(loop = 0; loop < largeNumber; loop++) { 
      // t = srt[loop % 32];     
      for (unsigned mask = 1 << 4; mask; mask >>= 1) {
          if(srt[ord] & mask) { 
              sto[cr_byte] |= cr_bit_mask;
          }
          cr_bit_mask >>= 1;
          if (cr_bit_mask == 0) {
              cr_bit_mask = 0x80;
              cr_byte++;
          }
        }
  }

edited Jun 13 '23 at 03:53

answered Jun 13 '23 at 03:45

chux - Reinstate Monica

143,097
13
135
256

the biggest optimization would be to vectorize it with SIMD and parallelize it – phuclv Jun 13 '23 at 03:57
That approach is not part of standard C and since post was not tag for that, that idea is not mentioned. Yet your interesting idea deserves your own answer. – chux - Reinstate Monica Jun 13 '23 at 03:59
I suspect there's some kind of `sto[cr_byte] |= ( (st[ord] & 0x1F) << ?) | temp; temp = (st[ord] & 0x1F) >> ?;` approach that can eliminate the inner loop; which might be fun. – Brendan Jun 13 '23 at 04:24
@Brendan As I see it, one bit index has a period of 5 and another a period of 8. For large array, might make sense, when `n >= 40`, to have 40 lines of code dealing with each group of 40 conversions. Yet without a user supplied test harness, much work is theoretical. Certainly more deserving if OP supplied more info, example and a true function. – chux - Reinstate Monica Jun 13 '23 at 04:28
@Brendan The compiler can probably unroll the inner loop if it notices a pattern with constants. Optimizations should focus on removing branches. For example you could perhaps per-calculate the various shifts, store them in temporary variables and then always unconditionally shift, although sometimes with zero. – Lundin Jun 13 '23 at 08:33

Eric Postpischil · Accepted Answer · 2023-06-13T13:49:29.457

The code can be reduced to simply:

for (size_t i = 0; i < 10000; ++i)
    sto[i] |= bits[i % NBytes];

after doing some work to prepare bits. That loop can be unrolled, can be converted to units wider than uint8_t, can be converted to SIMD code, and can be parallelized. After a few of those, such as converting to wider units, it might stream at memory bandwidth, so further optimization might not be useful.

The work to prepare bits is to consolidate the bits from srt into a full bitmask:

#define NSrt    32          //  Number of elements in srt.
#define NBits   (NSrt*5)    //  Number of bits used from srt.
#define NBytes  (NBits/8)   //  Number of bytes used by NBits.

/*  Consolidate the bits of srt into bytes, processing eight elements of
    srt (i) and five elements of bits (j) per iteration.
*/
uint8_t bits[NBytes];
for (size_t i = 0, j = 0; i < NSrt; i += 8, j += 5)
{
    /*  Get five bits from each srt elements.  (b0 does not need to be
        masked because its high bits will be shifted out of the uint8_t
        span.)
    */
    uint8_t
        b0 = srt[i + 0],
        b1 = srt[i + 1] & 0x1f,
        b2 = srt[i + 2] & 0x1f,
        b3 = srt[i + 3] & 0x1f,
        b4 = srt[i + 4] & 0x1f,
        b5 = srt[i + 5] & 0x1f,
        b6 = srt[i + 6] & 0x1f,
        b7 = srt[i + 7] & 0x1f;

    //  Merge the five-bit segments into eight-bit units.
    bits[j + 0] = b0 << 3 | b1 >> 2;
    bits[j + 1] = b1 << 6 | b2 << 1 | b3 >> 4;
    bits[j + 2] = b3 << 4 | b4 >> 1;
    bits[j + 3] = b4 << 7 | b5 << 2 | b6 >> 3;
    bits[j + 4] = b6 << 5 | b7;
}

Very nice. (Just a little more work if `NSrt%8 != 0`). – chux - Reinstate Monica Jun 13 '23 at 14:00 — chux - Reinstate Monica, Jun 13 '23 at 14:00

How can I optimize this bit-packing function in C?

2 Answers2