2

I need to write a series of unsigned integers to a file, each one being no greater than a limit n determined at runtime. To save space, I want to pack them in as little bytes as possible. However, I've no idea how to compute the minimum number of bytes necessary to hold them, so I only have the following, ugly solution:

int get_needed_bytes(uint32_t n) {
    if (n < 256) return 1;
    else if (n < 65536) return 2;
    else if (n < 16777216) return 3;
    return 4;
}

Is there a better way to achieve the same purpose?

michaelmeyer
  • 7,985
  • 7
  • 30
  • 36
  • 6
    Good enough. Q. for you: how are you going to read them back in again? – Jongware Mar 25 '14 at 21:27
  • I don't suppose you're familiar with how an ASN.1 INTEGER type is encoded. Something tells me you may find it... informative. – WhozCraig Mar 25 '14 at 21:34
  • 1
    If you know the max. number of bytes you need, as it seems from your code, your approach is fast and easy to understand. – ChronoTrigger Mar 25 '14 at 21:50
  • @Jongware: Perhaps I did'nt made this clear, but I need all the integers to have the same size (whatever it be), so writing and reading them back is ok. – michaelmeyer Mar 25 '14 at 21:51
  • @doukremt Do you explicitly include how many numbers there are? Otherwise, how will you differentiate `[1, 2, 3, 4] -> 01 02 03 04` from `[0x1020304] -> 01 02 03 04`? –  Mar 25 '14 at 21:53
  • @delnan: I plan to write a single byte at the beginning of the file to indicate the chosen integer size. – michaelmeyer Mar 25 '14 at 21:55

2 Answers2

2

You might try something along these lines (untested).

int GetNeededBytes(uint32_t n)
{
    // Maximum number of bytes supported
    int bytes = 4;
    // Get mask for highest order byte
    // Warning: watch for overflow here
    // 4 bytes should resolve to 0xff000000
    int mask = 0xff << (bytes * 8);

    while (bytes > 0)
    {
        if (n & mask)
            return bytes;
        mask /= 0x100;
        bytes--;
    }
    return 0;
}

But I'm not sure why this is a good idea. In order to read the values back, you need a way to flag how many bytes represent the next value. I suspect that count value will take away most of the bytes you saved.

There are better compression techniques available.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • According to real data analysis (I have a usable dataset but can't publish it), this saves on average 1-2 bytes per uint32 even if you have to use a lead byte for size. I was in a case where the number of bytes required could be encoded in spare bits in the preceding control byte. – Joshua Mar 25 '14 at 21:35
  • I'm sure this would depend on your data. there are better compression techniques available. – Jonathan Wood Mar 25 '14 at 21:42
  • Just for laughs, as it doesn't concern the OP's idea: when storing *unknown* byte lengths in the file, would it be possible to mimick UTF-8 encoding and use the most significant bit to store 'more to follow'? That decreases the range for 1 byte to 0..127 (and similarly, halves the range of more bytes) but may be 'convenient enough'. – Jongware Mar 25 '14 at 23:22
  • I had considered that too. It would require a bit more work because the most significant bit would not be available then for byte values. So that bit would need to be shifted to the next byte, and so on. I suspect that would be more efficient in bytes used, although less efficient performance wise. – Jonathan Wood Mar 26 '14 at 04:47
2

Another approach is to use any one of several compression libraries (zlib, bzip2, etc.) which will likely encode your data into fewer bytes, unless your data do not compress well (say, purely random data, cast to integers, which can perform worse).

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • If the OP's data doesn't make use of the integer size (which is an assumption of the original approach) then the data would contain an effectively compressible bit pattern. – zakinster Mar 25 '14 at 21:44