Compressing a 'char' array using bit packing in C

Question

I have a large array (around 1 MB) of type unsigned char (i.e. uint8_t). I know that the bytes in it can have only one of 5 values (i.e. 0, 1, 2, 3, 4). Moreover we do not need to preserve '3's from the input, they can be safely lost when we encode/decode.

So I guessed bit packing would be the simplest way to compress it, so every byte can be converted to 2 bits (00, 01..., 11).

As mentioned all elements of value 3 can be removed (i.e. saved as 0). Which gives me option to save '4' as '3'. While reconstructing (decompressing) I restore 3's to 4's.

I wrote a small function for the compression but I feel this has too many operations and just not efficient enough. Any code snippets or suggestion on how to make it more efficient or faster (hopefully keeping the readability) will be very much helpful.

/// Compress by packing ...
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
    for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
    {
      uint8_t temp[4];

      for (int small_loop = 0; small_loop < 4; small_loop++)
      {
        temp[small_loop] = *in;           // Load into local variable

        if (temp[small_loop] == 3)        // 3's are discarded
          temp[small_loop] = 0;
        else if (temp[small_loop] == 4)   // and 4's are converted to 3
          temp[small_loop] = 3;

      } // end small loop

      // Pack the bits into write pointer
      *out = (uint8_t)((temp[0] & 0x03) << 6) |
                      ((temp[1] & 0x03) << 4) |
                      ((temp[2] & 0x03) << 2) |
                      ((temp[3] & 0x03));

    } // end loop
 }

Edited to make the problem more clear as it looked like I'm trying to save 5 values into 2 bits. Thanks to @Brian Cain for suggested wording.
Cross-posted on Code Review.

I don't understand your textual logic explaining how 5 values can be represented by only 2 bits. Are you using "escape" code? Otherwise I don't understand how when unpacking you will distinguish whether a `0` means `0` or `3`. — Weather Vane, Jul 18 '17 at 20:07
@LeeDanielCrocker the post clearly says *I know that the bytes in it can have only one of 5 values (i.e. 0, 1, 2, 3, 4).* — Weather Vane, Jul 18 '17 at 20:11
It sounds like a valid encoding is to delete input values with `3` because they don't contribute. Effectively there are four distinct values that must be preserved during encoding. — Brian Cain, Jul 18 '17 at 20:11
@WeatherVane, we could reword this as "the inputs should be constrained to 0,1,2,3,4 but we do not need to preserve '3's from the input, they can be safely lost when we encode/decode." — Brian Cain, Jul 18 '17 at 20:12
We are not a coding service. What is your specific problem and why do you think it is too slow? Also it is not clear how you intend to pack 5 valuers into 2 bits. — too honest for this site, Jul 18 '17 at 20:14
For working code there is https://codereview.stackexchange.com/ — chux - Reinstate Monica, Jul 18 '17 at 20:14
What matters is not that `3` is unimportant, but that you **know** it is a `3` on decoding. Otherwise how will you know the `0` you encode both `0` and `3` as, is the important `0` or the unimportant `3`? — Weather Vane, Jul 18 '17 at 20:26
@chux I was not aware of codereview.stackexchange . I might try there. Just to clarify I am not looking for just an working solution or ready-to-compile code. Any suggestions or hints on how to handle the operations more efficiently but maintaining the readability will be of much help. Thank you. — avikpram, Jul 18 '17 at 20:29
@avikpram [Code Review is a question and answer site for seeking peer review of your code.](https://codereview.stackexchange.com/tour) fits the "Any suggestions or hints on how to handle the operations more efficiently but maintaining the readability will be of much help." — chux - Reinstate Monica, Jul 18 '17 at 20:32
I'm voting to close this question as off-topic because it belongs on https://codereview.stackexchange.com/ — chux - Reinstate Monica, Jul 18 '17 at 20:48
@chux I disagree, whenever no effort is shown we moan, now OP showed a real effort, had he just asked: I did this which doesn't work, could you help me finish it and make it faster it would have belonged here. — Tony Tannous, Jul 18 '17 at 21:01
@TonyTannous As is, the post does not meet codereview's standards, yet only needs some additional wrapping code and a few other things to meet that sites standard's - accessible [here](https://codereview.stackexchange.com/questions/ask) — chux - Reinstate Monica, Jul 18 '17 at 21:19
Cast the final close vote as [this question *has* now been posted on Code Review](https://codereview.stackexchange.com/questions/169596/compressing-a-char-array-using-bit-packing), so it makes sense to keep the answers together in one place. If you have an answer to post, please do so over there. — Cody Gray - on strike, Jul 19 '17 at 08:42
Thanks to everybody with special mention to @BiranCane . I have selected [this answer](https://codereview.stackexchange.com/a/169597/143628) in [codereview.se] . — avikpram, Jul 19 '17 at 12:38

score 3 · Answer 1 · edited Jul 19 '17 at 08:18

Your function has a bug: when loading the small array, you should write:

    temp[small_loop] = in[small_loop];

You can get rid of the tests with a lookup table, either on the source data, or more efficiently on some intermediary result:

In the code below, I use a small table lookup5 to convert the values 0,1,2,3,4 to 0,1,2,0,3, and a larger one to map groups of 4 3-bit values from the source array to the corresponding byte value in the packed format:

#include <stdint.h>

/// Compress by packing ...
void compressByPacking0(uint8_t *out, uint8_t *in, uint32_t length) {
    static uint8_t lookup[4096];
    static const uint8_t lookup5[8] = { 0, 1, 2, 0, 3, 0, 0, 0 };

    if (lookup[0] == 0) {    
        /* initialize lookup table */
        for (int i = 0; i < 4096; i++) {
            lookup[i] = (lookup5[(i >> 0) & 7] << 0) +
                        (lookup5[(i >> 3) & 7] << 2) +
                        (lookup5[(i >> 6) & 7] << 4) +
                        (lookup5[(i >> 9) & 7] << 6);
        }
    }
    for (; length >= 4; length -= 4, in += 4, out++) {
         *out = lookup[(in[0] << 9) + (in[1] << 6) + (in[2] << 3) + (in[3] << 0)];
    }
    uint8_t last = 0;
    switch (length) {
      case 3:
        last |= lookup5[in[2]] << 4;
        /* fall through */
      case 2:
        last |= lookup5[in[1]] << 2;
        /* fall through */
      case 1:
        last |= lookup5[in[0]] << 0;
        *out = last;
        break;
    }
}

Notes:

The code assumes the array does not contain values outside the specified range. Extra protection against spurious input can be achieved at a minimal cost.
The dummy << 0 are here only for symmetry and compile to no extra code.
The lookup table could be initialized statically, via a build time script or a set of macros.
You might want to unroll this loop 4 or more times, or let the compiler decide.

You could also use this simpler solution with a smaller lookup table accessed more often. Careful benchmarking will tell you which is more efficient on your target system:

/// Compress by packing ...
void compressByPacking1(uint8_t *out, uint8_t *in, uint32_t length) {
    static const uint8_t lookup[4][5] = {
        { 0 << 6, 1 << 6, 2 << 6, 0 << 6, 3 << 6 },
        { 0 << 4, 1 << 4, 2 << 4, 0 << 4, 3 << 4 },
        { 0 << 2, 1 << 2, 2 << 2, 0 << 2, 3 << 2 },
        { 0 << 0, 1 << 0, 2 << 0, 0 << 0, 3 << 0 },
    };

    for (; length >= 4; length -= 4, in += 4, out++) {
         *out = lookup[0][in[0]] + lookup[1][in[1]] +
                lookup[2][in[2]] + lookup[3][in[3]];
    }
    uint8_t last = 0;
    switch (length) {
      case 3:
        last |= lookup[2][in[2]];
        /* fall through */
      case 2:
        last |= lookup[1][in[1]];
        /* fall through */
      case 1:
        last |= lookup[0][in[0]];
        *out = last;
        break;
    }
}

Here is yet another approach, without any tables:

/// Compress by packing ...
void compressByPacking2(uint8_t *out, uint8_t *in, uint32_t length) {
#define BITS ((1 << 2) + (2 << 4) + (3 << 8))
    for (; length >= 4; length -= 4, in += 4, out++) {
         *out = ((BITS << 6 >> (in[0] + in[0])) & 0xC0) +
                ((BITS << 4 >> (in[1] + in[1])) & 0x30) +
                ((BITS << 2 >> (in[2] + in[2])) & 0x0C) +
                ((BITS << 0 >> (in[3] + in[3])) & 0x03);
    }
    uint8_t last = 0;
    switch (length) {
      case 3:
        last |= (BITS << 2 >> (in[2] + in[2])) & 0x0C;
        /* fall through */
      case 2:
        last |= (BITS << 4 >> (in[1] + in[1])) & 0x30;
        /* fall through */
      case 1:
        last |= (BITS << 6 >> (in[0] + in[0])) & 0xC0;
        *out = last;
        break;
    }
}

Here is a comparative benchmark on my system, Macbook pro running OS/X, with clang -O2:

compressByPacking(1MB) -> 0.867ms
compressByPacking0(1MB) -> 0.445ms
compressByPacking1(1MB) -> 0.538ms
compressByPacking2(1MB) -> 0.824ms

The compressByPacking0 variant is fastest, almost twice as fast as your code. It is a little disappointing, but the code is portable. You might squeeze more performance using handcoded SSE optimizations.

The one compiler I tried (`clang` ToT) was able to unroll the loop, FYI. — Brian Cain, Jul 18 '17 at 21:13
@chqrlie True. Yet the loop if then using 3 counters. Seems like at least 1 more than needed. — chux - Reinstate Monica, Jul 18 '17 at 21:26
@chux: I could use a boundary pointer on `in` and no longer update `length`, same number of registers, possibly a fraction of cycle less, if at all. Also less likely to get unrolled by the compiler. — chqrlie, Jul 18 '17 at 21:29

score 1 · Answer 2 · answered Jul 18 '17 at 21:10

I have a large array (around 1 MB)

Either this is a typo, your target is seriously aging, or this compression operation is invoked repeatedly in the critical path of your application.

Any code snippets or suggestion on how to make it more efficient or faster (hopefully keeping the readability) will be very much helpful.

In general, you will find the best information by empirically measuring the performance and inspecting the generated code. Using profilers to determine what code is executing, where there are cache misses and pipeline stalls -- these can help you tune your algorithm.

For example, you chose a stride of 4 elements. Is that just because you are mapping four input elements to a single byte? Can you use native SIMD instructions/intrinsics to operate on more elements at a time?

Also, how are you compiling for your target and how well is your compiler able to optimize your code?

Let's ask clang whether it finds any problems trying to optimize your code:

$ clang -fvectorize  -O3  -Rpass-missed=licm -c tryme.c 
tryme.c:11:28: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
        temp[small_loop] = *in;           // Load into local variable
                           ^
tryme.c:21:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
      *out = (uint8_t)((temp[0] & 0x03) << 6) |
                        ^
tryme.c:22:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
                      ((temp[1] & 0x03) << 4) |
                        ^
tryme.c:23:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
                      ((temp[2] & 0x03) << 2) |
                        ^
tryme.c:24:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
                      ((temp[3] & 0x03));
                        ^

I'm not sure but maybe alias analysis is what makes it think it can't move this load. Try playing with __restrict__ to see if that has any effect.

$ clang -fvectorize  -O3  -Rpass-analysis=loop-vectorize  -c tryme.c 
tryme.c:13:13: remark: loop not vectorized: loop contains a switch statement [-Rpass-analysis=loop-vectorize]
        if (temp[small_loop] == 3)        // 3's are discarded

I can't think of anything obvious that you can do about this one unless you change your algorithm. If the compression ratio is satisfactory without deleting the 3s, you could perhaps eliminate this.

So what's the generated code look like? Take a look below. How could you write it better by hand? If you can write it better yourself, either do that or feed it back into your algorithm to help guide the compiler.

Does the compiled code take advantage of your target's instruction set and registers?

Most importantly -- try executing it and see where you're spending the most cycles. Stalls from branch misprediction, unaligned loads? Maybe you can do something about those. Use what you know about the frequency of your input data to give the compiler hints about the branches in your encoder.

$ objdump -d --source tryme.o
...
0000000000000000 <compressByPacking>:
#include <stdint.h>

void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
    for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
   0:   c1 ea 02                shr    $0x2,%edx
   3:   0f 84 86 00 00 00       je     8f <compressByPacking+0x8f>
   9:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
    {
      uint8_t temp[4];

      for (int small_loop = 0; small_loop < 4; small_loop++)
      {
        temp[small_loop] = *in;           // Load into local variable
  10:   8a 06                   mov    (%rsi),%al

        if (temp[small_loop] == 3)        // 3's are discarded
  12:   3c 04                   cmp    $0x4,%al
  14:   74 3a                   je     50 <compressByPacking+0x50>
  16:   3c 03                   cmp    $0x3,%al
  18:   41 88 c0                mov    %al,%r8b
  1b:   75 03                   jne    20 <compressByPacking+0x20>
  1d:   45 31 c0                xor    %r8d,%r8d
  20:   3c 04                   cmp    $0x4,%al
  22:   74 33                   je     57 <compressByPacking+0x57>
  24:   3c 03                   cmp    $0x3,%al
  26:   88 c1                   mov    %al,%cl
  28:   75 02                   jne    2c <compressByPacking+0x2c>
  2a:   31 c9                   xor    %ecx,%ecx
  2c:   3c 04                   cmp    $0x4,%al
  2e:   74 2d                   je     5d <compressByPacking+0x5d>
  30:   3c 03                   cmp    $0x3,%al
  32:   41 88 c1                mov    %al,%r9b
  35:   75 03                   jne    3a <compressByPacking+0x3a>
  37:   45 31 c9                xor    %r9d,%r9d
  3a:   3c 04                   cmp    $0x4,%al
  3c:   74 26                   je     64 <compressByPacking+0x64>
  3e:   3c 03                   cmp    $0x3,%al
  40:   75 24                   jne    66 <compressByPacking+0x66>
  42:   31 c0                   xor    %eax,%eax
  44:   eb 20                   jmp    66 <compressByPacking+0x66>
  46:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4d:   00 00 00 
  50:   41 b0 03                mov    $0x3,%r8b
  53:   3c 04                   cmp    $0x4,%al
  55:   75 cd                   jne    24 <compressByPacking+0x24>
  57:   b1 03                   mov    $0x3,%cl
  59:   3c 04                   cmp    $0x4,%al
  5b:   75 d3                   jne    30 <compressByPacking+0x30>
  5d:   41 b1 03                mov    $0x3,%r9b
  60:   3c 04                   cmp    $0x4,%al
  62:   75 da                   jne    3e <compressByPacking+0x3e>
  64:   b0 03                   mov    $0x3,%al
          temp[small_loop] = 3;

      } // end small loop

      // Pack the bits into write pointer
      *out = (uint8_t)((temp[0] & 0x03) << 6) |
  66:   41 c0 e0 06             shl    $0x6,%r8b
                      ((temp[1] & 0x03) << 4) |
  6a:   c0 e1 04                shl    $0x4,%cl
  6d:   80 e1 30                and    $0x30,%cl
          temp[small_loop] = 3;

      } // end small loop

      // Pack the bits into write pointer
      *out = (uint8_t)((temp[0] & 0x03) << 6) |
  70:   44 08 c1                or     %r8b,%cl
                      ((temp[1] & 0x03) << 4) |
                      ((temp[2] & 0x03) << 2) |
  73:   41 c0 e1 02             shl    $0x2,%r9b
  77:   41 80 e1 0c             and    $0xc,%r9b
                      ((temp[3] & 0x03));
  7b:   24 03                   and    $0x3,%al

      } // end small loop

      // Pack the bits into write pointer
      *out = (uint8_t)((temp[0] & 0x03) << 6) |
                      ((temp[1] & 0x03) << 4) |
  7d:   44 08 c8                or     %r9b,%al
                      ((temp[2] & 0x03) << 2) |
  80:   08 c8                   or     %cl,%al
          temp[small_loop] = 3;

      } // end small loop

      // Pack the bits into write pointer
      *out = (uint8_t)((temp[0] & 0x03) << 6) |
  82:   88 07                   mov    %al,(%rdi)
#include <stdint.h>

void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
    for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
  84:   48 83 c6 04             add    $0x4,%rsi
  88:   48 ff c7                inc    %rdi
  8b:   ff ca                   dec    %edx
  8d:   75 81                   jne    10 <compressByPacking+0x10>
                      ((temp[1] & 0x03) << 4) |
                      ((temp[2] & 0x03) << 2) |
                      ((temp[3] & 0x03));

    } // end loop
 }
  8f:   c3                      retq

Cane "Either this is a typo, your target is seriously aging, " :) It is not a typo, my code is for an embdded device. I am saving this array in flash at certain milestones or update intervals and the whole compression thingy is to save flash memory (we don't have much of that). The code is cross-compiled for a ARM9 target using gcc-arm-eabi and I am using optimization level 2 ("O2"). I wasn't aware of the 'clang' feature; I would try that. I admit I didn't dig enough so thanks for that. I was "hoping" that there must be an 'obvious' trick that I am missing. — avikpram, Jul 18 '17 at 21:26
"I have a large array (around 1 MB) Either this is a typo, your target is seriously aging, ..." is not needed. Billions of small processors are make each year, 2017, that have very restrictive memory compare to PCs, etc. C is very popular there and concerns about speed and smaller memory requirements are legitimate - even if does not apply in OP's case. — chux - Reinstate Monica, Jul 18 '17 at 21:32
@chux that may be true but ARM9s are not small processors and a 1MB array is not what I'd call "large" for an ARM9. — Brian Cain, Jul 18 '17 at 22:47

score 1 · Answer 3 · answered Jul 18 '17 at 23:07

In all the excitement about performance, functionality is overlooked. Code is broke.

    // temp[small_loop] = *in;           // Load into local variable
    temp[small_loop] = in[small_loop];

Alternative:

How about a simple tight loop?

Use const and restrict to allow various optimizations.

void compressByPacking1(uint8_t* restrict out, const uint8_t* restrict in,
    uint32_t length) {
  static const uint8_t t[5] = { 0, 1, 2, 0, 3 };
  uint32_t length4 = length / 4;
  unsigned v = 0;
  uint32_t i;
  for (i = 0; i < length4; i++) {
    for (unsigned j=0; j < 4; j++) {
      v <<= 2;
      v |= t[*in++];
    }
    out[i] = (uint8_t) v;
  }
  if (length & 3) {
    v = 0;
    for (unsigned j; j < 4; j++) {
      v <<= 2;
      if (j < (length & 3)) {
        v |= t[*in++];
      }
    }
    out[i] = (uint8_t) v;
  }
}

Tested and found this code to be about 270% as fast (41 vs 15) (YMMV).
Tested and found to form the same output as OP's (corrected) code

have you decompressed to check if it works ok? – 0___________ Jul 18 '17 at 23:15 — 0___________, Jul 18 '17 at 23:15

0___________ · Answer 4 · 2017-07-18T23:06:24.267

-1

Update: Tested

Unsafe version is the fastest - fastest than other ones in another answers. Tested with VS2017

const uint8_t table[4][5] = 
{ { 0 << 0,1 << 0,2 << 0,0 << 0,3 << 0 },
  { 0 << 2,1 << 2,2 << 2,0 << 2,3 << 2 },
  { 0 << 4,1 << 4,2 << 4,0 << 4,3 << 4 },
  { 0 << 6,1 << 6,2 << 6,0 << 6,3 << 6 },
};



void code(uint8_t *in, uint8_t *out, uint32_t len)
{
    memset(out, 0, len / 4 + 1);
    for (uint32_t i = 0; i < len; i++)
        out[i / 4] |= table[i & 3][in[i] % 5];
}

void code_unsafe(uint8_t *in, uint8_t *out, uint32_t len)
{
    for (uint32_t i = 0; i < len; i += 4, in += 4, out++)
    {
        *out = table[0][in[0]] | table[1][in[1]] | table[2][in[2]] | table[3][in[3]];
    }
}

To check how it is written it is enough to compile it - even online

https://godbolt.org/g/Z75NQV

There are small very simple my coding functions - just for comparition of the compiler generated code, not tested.

edited Jul 18 '17 at 23:06

answered Jul 18 '17 at 21:15

0___________

60,014
4
34
74

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/16753552) – shoover Jul 18 '17 at 23:01
Downvoted ?????? – 0___________ Jul 18 '17 at 23:07
Maybe you could explain why the unsafe version is unsafe. – Retired Ninja Jul 18 '17 at 23:20
Does not check if the input has valid values, if not it is an UB. Number of input data items has to be dividable by 4. Otherwise it will read outside the input table which is an another UB. So it is unsafe – 0___________ Jul 18 '17 at 23:29

score -2 · Answer 5 · answered Jul 18 '17 at 20:49

-2

Does this look clearer?

void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
    assert( 0 == length % 4 );
    for (int loop = 0; loop < length; loop += 4)
    {
      uint8_t temp = 0;
      for (int small_loop = 0; small_loop < 4; small_loop++)
      {
        uint8_t inv = *in;  // get next input value
        switch(inv)
        {
          case 0:  // encode as 00
          case 3:  // change to 0
             break;
          case 1:
              temp |= (1 << smal_loop*2); // 1 encode as '01'
              break;
          case 2:
              temp |= (2 << smal_loop*2);  // 2 encode as '10'
              break;
          case 4:
              temp |= (3 << smal_loop*2);  // 4 encode as '11'
              break;
          default:
              assert(0);
        }
      } // end inner loop

      *out = temp;

    } // end outer loop
 }

answered Jul 18 '17 at 20:49

ddbug

1,392
1
11
25

1

Please explain too – Pushan Gupta Jul 18 '17 at 20:49
OP requested "make it more efficient or faster ", not "clearer". Have you profiled the original and this approach? – chux - Reinstate Monica Jul 18 '17 at 20:50
1

Code does not compile. – chux - Reinstate Monica Jul 18 '17 at 20:53
@chux: My variant uses one local variable 'temp' that most likely will be cached in a register and all bit or operations will be performed on a register. The original variant has an array on a stack. Of course a good compiler can optimize this too, but that is less obvious. (OP: if 'assert' does not compile, comment it out for now, until you learn what it means) – ddbug Jul 18 '17 at 20:55
@ddbug what do you mean by "operations will be performed on a register" ? – Tony Tannous Jul 18 '17 at 20:57
@TonyTannous C code is translated to machine instructions. Most normal machines' CPUs have registers - internal very fast storage, compared to memory. Machine instructions that deal with registers usually are much faster and shorter than references to memory (even with cache). The 'bit OR' operation (|=) most likely will compile into a small number of machine instructions. – ddbug Jul 18 '17 at 21:02

Compressing a 'char' array using bit packing in C

5 Answers5