Code and decode variable length integer in binary format efficiently

Question

I need to store integers in range (0-50000000) in binary file and decode them later. To save space I am storing number of bytes required to decode the integer in first 2 bits of first byte i.e 01XXXXXX refers 2 bytes are required to save the number.I was facing issue in the implementation.The number I am getting after decoding is not correct. Here is my sample code -

int main()
{
    FILE *input = NULL,  
            *output = NULL;

    output = fopen("sample.bin","wb+");

    unsigned int num =  32594; //(this number would come from input file)
    char buff_ts[4];
    sprintf(buff_ts,"%d",num);
    setBitAt(buff_ts, sizeof(buff_ts), 23, 1); // set first two bits
    fwrite(buff_ts,1,sizeof(buff_ts),output);
    fclose(output);

    input = fopen("sample.bin", "rb");

    int diff;
    char buff[1];
    fread(buff,1,1,input);
    char buff_copy = buff[0];
    int temp = atoi(buff);

    int more_bytes_to_read = (temp>>6); // read first 2 bits
    buff_copy = buff_copy & ((1<<6)-1); // reset first 2 bits

    if(more_bytes_to_read==0) // if no more bytes to read
    {
        diff = buff_copy;
    }
    else
    {
        char extra_buff[more_bytes_to_read];
        fread(extra_buff,1,sizeof(extra_buff),input); // read extra bytes
        char num_buf[more_bytes_to_read+1];
        num_buf[0] = buff_copy;  // copy prev read buffer
        for(int i=1;i<=more_bytes_to_read;i++)
        {
            num_buf[i] = extra_buff[i-1];
        }
        diff = atoi(num_buf);
    }
        cout<<diff<<endl;

        return 0;
}

This range is nicely fitting into 4 bytes. You can save about 5 bits per number. Is it worth it, especially if using two extra bits (so 3 are saved)? (BTW, this code is not C) — Eugene Sh., Mar 30 '16 at 17:59
I had quite large data of around 50 K values and most of the values would fit in under 1 byte. So it should save quite a space — learner, Mar 30 '16 at 18:02
then I would suggest you [Shannon encoding](https://en.wikipedia.org/wiki/Shannon_coding) scheme. — Eugene Sh., Mar 30 '16 at 18:04
You assign a buffer `buff_ts` with the size of 4 chars. Then you use sprintf to print into that buffer the number "32594136" which gets 8 chars long + 1 more char for zero-termination. That will give you some buffer overflow with more or less random results. Storing numbers as human readable strings are usually not what you mean by "binary file". Decoding numbers from a binary file are usually not done with a call to `atoi`. Why not simply store your 32 bit integer as 32 bits in a binary file? — Henrik Carlqvist, Mar 30 '16 at 18:04
I have used buffer of 4 chars = 4*8 bits = 32 bits so that it can store integer range and 32594136 is declared as int. I have used atoi to convert the char buffer that I have read from the input file to int format. — learner, Mar 30 '16 at 18:10
But you use sprintf to write more than 4 chars into your buffer. `sprintf(buff_ts,"%d",num);` will write the string "32594136" into your buffer when num is 32594136. The string "32594136" will need a buffer with the size of at least 9 chars including the string terminating zero. — Henrik Carlqvist, Mar 30 '16 at 18:12
Ok I got you for this case it is overflowing but I was getting un-correct result in case of small numbers too. — learner, Mar 30 '16 at 18:15
The `sprintf()` is incorrect for small numbers, too. The 'f' in `sprintf()` is mnemonic for "formatted", which, roughly speaking, is the *opposite* of binary. Perhaps you want `memcpy()` instead. — John Bollinger, Mar 30 '16 at 18:17
Then again, `memcpy()` makes your program sensitive to details of integer representation. I guess you really want an arithmetic solution for putting the bytes of your number into `buff_ts`. — John Bollinger, Mar 30 '16 at 18:19
memcpy could be used, but for writing there is no big need for a buffer if you really want a binary file. You could simply do `fwrite(&num,1,sizeof(num),output);` There is no big point in storing the size of a number in two bits, you will not be able to write anything smaller than an entire byte to a file. — Henrik Carlqvist, Mar 30 '16 at 18:22
@Henrik At time of decoding I need to know how many bytes I need to read from the file. Since there are small numbers too storing everything as 4 byte is not an option. — learner, Mar 30 '16 at 18:24
Is this a PC, or is it a MPU project with limited resources? In the first, 50K values is not "large", hardly worth the effort of creating a file which is only of use with your personal decoder. Keep it simple. — Weather Vane, Mar 30 '16 at 18:27
The problem here is that you're optimizing before you've mastered the basics. Start by writing/reading the numbers without using compression. Once you've got that working, then you can think about compression, but frankly, 50K values stored as 50K bytes versus 200K bytes doesn't amount to anything on a modern machine. — user3386109, Mar 30 '16 at 18:28
Here is the main problem which I was working on - http://stackoverflow.com/questions/36245562/compressing-unix-timestamps-with-microseconds-accuracy/36273954#36273954 This implementation is corresponding to this problem only. I had done the basic reading/writing now need to improve upon compression ratio. — learner, Mar 30 '16 at 18:35

score 2 · Answer 1 · answered Mar 30 '16 at 19:30

2

#include <stdio.h>
#include <stdint.h>

int little_endian(void)
{
   uint16_t t=0x01;
   char *p = (char *)&t;

   return (*p > *(p+1));   
}

uint32_t swap_bytes(uint32_t i)
{
   uint32_t o;
   char *p = (char *)&i;
   char *q = (char *)&o;

   q[0]=p[3];
   q[1]=p[2];
   q[2]=p[1];
   q[3]=p[0];

   return o;
}

uint32_t fix_endian(uint32_t i)
{
   if(little_endian())
      return swap_bytes(i);
   else
      return i;
}

int encode_num(uint32_t num, char *buf)
{
   int extra_bytes_needed;
   uint32_t *p = (uint32_t *) buf;
   if(num <= 0x3f)
      extra_bytes_needed=0;
   else if(num <= 0x3fff)
      extra_bytes_needed=1;
   else if(num <= 0x3fffff)
      extra_bytes_needed=2;
   else if(num <= 0x3fffffff)
      extra_bytes_needed=3;

   *p = fix_endian(num);
   if(little_endian())
      *p = *p >> (8*(3 - extra_bytes_needed));
   else
      *p = *p << (8*(3 - extra_bytes_needed));

   *buf |= extra_bytes_needed << 6;

   return extra_bytes_needed + 1;
}

int main()
{
    FILE *input = NULL,  
       *output = NULL;
    int i;
    uint32_t nums[10] = {32594136, 1, 2, 3, 4, 5, 6, 7, 8 , 193};
    char buff_ts[4];
    unsigned char c;
    int len;
    uint32_t num;
    int more_bytes_to_read;

    output = fopen("sample.bin","wb+");

    for(i=0; i<10; i++)
    {
       len = encode_num(nums[i], buff_ts);
       fwrite(buff_ts,1,len,output);
    }
    fclose(output);

    input = fopen("sample.bin", "rb");

    while(fread(&c,1,1,input)==1)
    {
       more_bytes_to_read=c>>6;
       num = c & 0x3f;
       while(more_bytes_to_read--)
       {
          fread(&c,1,1,input);
          num <<= 8;
          num |= c;
       }
       printf("Read number %d\n", num);
    }
    return 0;
}

answered Mar 30 '16 at 19:30

Henrik Carlqvist

1,138
5
6

1

Given some thought, the code would be cleaner if using only shifts and masks instead of those endian checks and conditional byte swaps. Now that improvement is left as an exercise for the interested reader :-) – Henrik Carlqvist Mar 30 '16 at 19:43
I had a doubt that in encode_num function after call to fix_endian why we need to check for endianness again. Suppose I have a number 32594 then after fix_endian we would have 00000000 00000000 0111 1111 0101 0010 then we need to write 2 bits at MSB. So in case of little endian *p would point at first byte so we shift that to (8*(3-2) = 8 bits right) 3rd byte and write extra_bytes_needed there but I am not sure what would happen in case of big endian. How shifting left would serve our purpose ? Let me know if I am understanding it correctly. – learner Mar 31 '16 at 03:21
You might be right, I have only tested the code on a little endian machine. The idea, however, is that you don't allways write all 4 bytes of your 32 bit integer. You only write the non zero bytes. But as you write the first bytes, and the first bytes are zero for small numbers on a big endian system the 32 bit number is left shifted on big endian systems. On little endian systems however, the number is right shifted. But as I wrote in my comment, all this byte swapping and checking for endianness could be cleaner done by only masking and shifting. – Henrik Carlqvist Mar 31 '16 at 06:45
@HenrikCarlqvist Could you use `x * 256` and `x / 65536` as an endian-independent version of `x << 8` and `x >> 16`? – m69's been on strike for years Mar 31 '16 at 15:54
@m69 `x*256` and `x/65536` are as you say endian-independent. However, also `x<<8` and `x>>16` does exactly the same and are endian-independent as long as you only look at your data as integers. However both shifting, multiplication and division becomes endian-dependent when you start looking at your data as a sequence of bytes. – Henrik Carlqvist Mar 31 '16 at 16:53
(Oh,ok; I thougth the endian problem also occured at the bit-level.) Anyway, If you split the 32-bit input values into 1-4 bytes with shifts and masks, and write both the storing and retrieving functions using `fwrite(pointer_to_uint_8, 1, number_of_bytes, output)` and `fread(pointer_to_uint_8, 1, number_of_bytes, input)` then surely the endianness doesn't come into play? The sequence of bytes is never treated as 16 or 32-bit integers. – m69's been on strike for years Mar 31 '16 at 18:16
Yes, those shifts and masks to split the 32-bit values into bytes was the suggestion in my first comment to my own answer to avoid endian problems. – Henrik Carlqvist Mar 31 '16 at 20:21

Code and decode variable length integer in binary format efficiently

1 Answers1