0

What's a nice, readable way of getting the byte representation (i.e. a byte[]) of an int, but only using 3 bytes (instead of 4)? I'm using Hadoop/Hbase and their Bytes utility class has a toBytes function but that will always use 4 bytes.

Ideally, I'd also like a nice, readable way of encoding to as few bytes as possible, i.e. if the number fits in one byte then only use one.

Please note that I'm storing this in a byte[], so I know the length of the array and thus variable length encoding is not necessary. This is about finding an elegant way to do the cast.

moinudin
  • 134,091
  • 45
  • 190
  • 216
  • Is byte[0] the LSB or the MSB? The [javadoc](http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(int)) is not clear – Ray Toal Jul 07 '12 at 00:07
  • You need to ensure that when you use a different type that this is actually smaller or faster (or what ever you priority is) e.g. you could replace an `int` with a `byte[]` but it would be much bigger, slower and more difficult to use. – Peter Lawrey Jul 07 '12 at 07:36
  • LSB/MSB, little endian/big endian? Do you have ranges or other spurious information in the integer etc. etc. – Maarten Bodewes Jul 07 '12 at 10:29
  • @PeterLawrey I'm aware of the consequences, but given the context (storage in hbase) that is of negligable impact as it doesn't affect storage requirements. – moinudin Jul 09 '12 at 20:59
  • @owlstead Big endian. All values are positive. Most will fit in a few bits. – moinudin Jul 09 '12 at 21:01
  • Have you considered stop bit encodings? numbers with 1-7 bits use one byte up to 14 bits two bytes and 21 bits, 3 bytes etc. – Peter Lawrey Jul 09 '12 at 22:06

6 Answers6

3

A general solution for this is impossible.

If it were possible, you could apply the function iteratively to obtain unlimited compression of data.

Your domain might have some constraints on the integers that allow them to be compressed to 24-bits. If there are such constraints, please explain them in the question.

A common variable size encoding is to use 7 bits of each byte for data, and the high bit as a flag to indicate when the current byte is the last.


You can predict the number of bytes needed to encode an int with a utility method on Integer:

int n = 4 - Integer.numberOfLeadingZeros(x) / 8;
byte[] enc = new byte[n];
while (n-- > 0) 
  enc[n] = (byte) ((x >>> (n * 8)) & 0xFF);

Note that this will encode 0 as an empty array, and other values in little-endian format. These aspects are easily modified with a few more operations.

erickson
  • 265,237
  • 58
  • 395
  • 493
  • This is only true if you are storing this data alongside other bytes. As a single `byte[]`, we know the length and we know all bytes in the array are set. – moinudin Jul 06 '12 at 23:58
  • Yes, I was talking about bytes with no external meta-data, like UTF-8-encoded characters. If you are storing a `length` value separately, I would expect more space to be used overall (4 bytes for the `length`, plus the data itself). – erickson Jul 07 '12 at 00:01
  • Java arrays store the size, no choice there. And Hbase has no knowledge of the data representation so in its own internal representation it encodes the start/end. – moinudin Jul 07 '12 at 00:05
  • @marcog Sorry about the bug in my original code. I didn't remember the details of the API. – erickson Jul 07 '12 at 03:58
  • Too many magics in my opinion. I always use `Byte.SIZE` instead of 8 (where applicable of course) and sometimes define UNSIGNED_BYTE_MAX_VALUE, although the latter may be a bit over the top. Personally, I don't use code that the reader needs to decipher, if it can be avoided. – Maarten Bodewes Jul 07 '12 at 10:27
  • @owlsted Yes, the code would be better with constants, and it should be encapsulated in a separate function for reduced complexity. My main point was to demonstrate the usefulness of the `numberOfLeadingZeros()` method to allocate the array; that is an important piece no other answer touched on. – erickson Jul 07 '12 at 15:56
1

If you need to represent the whole 2^32 existing 4-byte integers, you need to chose between:

  • fixed-size representation, using 4 bytes always; or
  • variable-size representation, using at least 5 bytes for some numbers.

Take a look on how UTF-8 encodes the Unicode charactes, you might get some insights. (you use some short prefix to describe how many bytes must be read for that unicode character, then you read that many bytes and interpret them).

Bruno Reis
  • 37,201
  • 11
  • 119
  • 156
  • I don't believe this is true in this case. Read my comment on @erickson's answer. – moinudin Jul 06 '12 at 23:59
  • I'm not sure I understand what you mean. If you are trying to use less than 4 bytes to represent a 32-bit int, I'm can only assume you want to save space. It appears, from what you said, "a single byte[]", that you have some other kind of structure to isolate one byte[] from the other. Think about this: this structure will take some space (even if you are not aware of it), therefore you failed to represent 2^32 ints in less than 32 bits. – Bruno Reis Jul 07 '12 at 00:04
  • I'm writing the data to hbase, and it has no knowledge of the structure of data being stored. So it has to represent the start/end or length of bytes. This is true whether I use 3 or 4 bytes. So writing 3 bytes should save space. – moinudin Jul 07 '12 at 00:07
  • 1
    The OP is not trying to represent data in less than 4 bytes, but rather to just use arrays of less than 4 elements. Of course the OP knows there is a lot of overhead in the representation, because arrays in Java also have space for their length. If he's writing a sequence of integers with variable byte-length representation, yeah, than metadata is required in order to know where one byte sequence starts and the other one stops.... – Ray Toal Jul 07 '12 at 00:08
1

Try using ByteBuffer. You can even set little endian mode if required:

int exampleInt = 0x11FFFFFF;
ByteBuffer buf = ByteBuffer.allocate(Integer.SIZE / Byte.SIZE);
final byte[] threeByteBuffer = new byte[3];
buf.putInt(exampleInt);
buf.position(1);
buf.get(threeByteBuffer);

Or the shortest signed, Big Endian:

BigInteger bi = BigInteger.valueOf(exampleInt);
final byte[] shortestSigned = bi.toByteArray();
Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
0

Convert your int to a 4 bytes array, and iterate it, if every high order byte is zero then remove it from array.

Something like:

byte[] bytes = toBytes(myInt);
int neededBytes = 4;
for (;neededBytes > 1; i--) {
    if (bytes[neededBytes - 1] != 0) {
       break;
    }
}

byte[] result = new byte[neededBytes];
// then just use array copy to copy first neededBytes to result.
Amir Pashazadeh
  • 7,170
  • 3
  • 39
  • 69
0

You can start with something like this:

byte[] Convert(int i)
{  // warning: untested
  if (i == 0)
    return new byte[0];
  if (i > 0 && i < 256)
    return new byte[]{(byte)i};
  if (i > 0 && i < 256 * 256)
    return new byte[]{(byte)i, (byte)(i >> 8)};
  if (i > 0 && i < 256 * 256 * 256)
    return new byte[]{(byte)i, (byte)(i >> 8), (byte)(i >> 16)};
  return new byte[]{(byte)i, (byte)(i >> 8), (byte)(i >> 16), (byte)(i >> 24)};
}

You'll need to decide if you want to be little-endian or big-endian. Note that negative numbers are encoded in 4 bytes.

bmm6o
  • 6,187
  • 3
  • 28
  • 55
0

If i understand right that you really, desperately want to save space, even at expense of arcane bit shuffling: any array type is an unecessary luxury because you cannot use less than one whole byte for the length = addressing space 256 while you know that at most 4 will be needed. So i would reserve 4 bits for the length and sign flag and cram the rest aligned to that number of bytes. You might even save one more byte if your MSB is less than 128. The sign flag i see useful for ability to represent negative numbers in less than 4 bytes too. Better have the bit there every time (even for positive numbers) than overhead of 4 bytes for representing -1.

Anyway, this all is a thin water until you make some statistics on your data set, how many integers are actually compressible and whether the compression overhead is worth the effort.

Pavel Zdenek
  • 7,146
  • 1
  • 23
  • 38