5

I would like to do the following...

a) Compress a generated UUID to String of length 8.

b) Decompress the compressed UUID back to the original UUID.

The reason is because I have to send the UUID to a partnering system, and the partnering system only accepts 8 chars for UUID, and no I cannot request for a change to the partnering system.

So, what is left to do is to compress UUID that I have to 8 char string and then decompress it back to the original UUID when a message is gotten back from the partnering system.

Any ideas?

Thanks.

rb8680
  • 279
  • 2
  • 3
  • 11
  • Java UUID's are 128-bits, so 16 x 8 bytes. Do you know of a compression algorithm that does, 100% of the time, 50% compression? – Jon Lin Aug 06 '12 at 02:59
  • By 8 chars, do you mean 8 bytes (8x8 bits for a total of 64 bits)? Is the limit on the total size of the data sent, is it confined by a specific data format, or is it something else? – jefflunt Aug 06 '12 at 03:14

3 Answers3

10

What you ask is impossible for information-theoretic reasons.

UUIDs as specified by RFC 4122 are 128 bits, as are UUID objects in Java.

Java Strings can store 16 bits per character, which would make for an 8-char string. However, not all bit sequences are valid UTF-16 strings, so in 8 characters you can store fewer than 128 bits of information.

So if you compress a UUID to a valid 8-character string, you have lost information, so in general there's no way to decompress it to retrieve the original UUID back.

What you might have intended is to generate a shorter string to use as a unique identifier. If so, see Generating 8-character only UUIDs.

Community
  • 1
  • 1
Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
5

The best way to achieve url safe uuid compression is to encode it in base64

public class UUIDUtils {

  public static String compress(UUID uuid) {
    ByteBuffer bb = ByteBuffer.allocate(Long.BYTES * 2);
    bb.putLong(uuid.getMostSignificantBits());
    bb.putLong(uuid.getLeastSignificantBits());
    byte[] array = bb.array();
    return Base64.getEncoder().encodeToString(array);
  }

  public static UUID decompress(String compressUUID) {
    ByteBuffer byteBuffer = ByteBuffer.wrap(Base64.getDecoder().decode(compressUUID));
    return new UUID(byteBuffer.getLong(), byteBuffer.getLong());
  }


}

Result: 6227185c-b25b-4497-b821-ba4f8d1fb9a1 -> YicYXLJbRJe4IbpPjR+5oQ==

xjodoin
  • 519
  • 5
  • 15
  • Are these compressed versions more vulnerable to collisions? For those who are lazy, that's 36 characters down to 24. – Slbox Jul 20 '22 at 03:50
  • No, it's not more vulnerable to a collision because it's the same thing but encode on Base64 – xjodoin Aug 25 '22 at 15:20
0

You can convert the UUID into a String which is really a sequence of 16-bit char 8 elements long as follows.

static String encodeUuid(final UUID id) {
  final long hi = id.getMostSignificantBits();
  final long lo = id.getLeastSignificantBits();
  return new String(new char[] {
    (char) ((hi >>> 48) & 0xffff), (char) ((hi >>> 32) & 0xffff),
    (char) ((hi >>> 16) & 0xffff), (char) ((hi       ) & 0xffff),
    (char) ((lo >>> 48) & 0xffff), (char) ((lo >>> 32) & 0xffff),
    (char) ((lo >>> 16) & 0xffff), (char) ((lo       ) & 0xffff)
  });
}

static UUID decodeUuid(final String enc) {
  final char[] cs = enc.toCharArray();
  return new UUID(
    (long) cs[0] << 48 | (long) cs[1] << 32 | (long) cs[2] << 16 | (long) cs[3],
    (long) cs[4] << 48 | (long) cs[5] << 32 | (long) cs[6] << 16 | (long) cs[7]
  );
}

This code indeed seems like it should work (try it yourself here), and can be encoded/decoded using both UTF-8 and UTF-16 without issue the majority of the time:

static boolean validate(final UUID id, final Charset cs) {
  final ByteBuffer buf = cs.encode(encodeUuid(id));
  final UUID _id = decodeUuid(cs.decode(buf).toString());
  return id.equals(_id);
}

public static void main(final String[] argv) {
  final UUID id = UUID.randomUUID();
  assert validate(id, StandardCharsets.UTF_8)  : "failed using utf-8";
  assert validate(id, StandardCharsets.UTF_16) : "failed using utf-16";
}

C:\dev\scrap>javac UuidTest.java

C:\dev\scrap>java -ea UuidTest

However there is indeed the problem that some UTF-16 code points are reserved as surrogates. In the case this happens, the encoding will not work and you will be unable to reconstruct the original UUID. Refer to Mechanical snail's response above for more information on that.


The only data you can consistently actually remove from an encoded UUID generated via UUID.randomUUID are those 2 used for variant (always 2) and the 4 bits used for version (always 4).

There exist different variants of these global identifiers. The methods of this class are for manipulating the Leach-Salz variant, although the constructors allow the creation of any variant of UUID (described below).

The layout of a variant 2 (Leach-Salz) UUID is as follows: The most significant long consists of the following unsigned fields: 0xFFFFFFFF00000000 time_low
0x00000000FFFF0000 time_mid
0x000000000000F000 version
0x0000000000000FFF time_hi

The least significant long consists of the following unsigned fields: 0xC000000000000000 variant
0x3FFF000000000000 clock_seq
0x0000FFFFFFFFFFFF node

The variant field contains a value which identifies the layout of the UUID. The bit layout described above is valid only for a UUID with a variant value of 2, which indicates the Leach-Salz variant.

The version field holds a value that describes the type of this UUID. There are four different basic types of UUIDs: time-based, DCE security, name-based, and randomly generated UUIDs. These types have a version value of 1, 2, 3 and 4, respectively.

obataku
  • 29,212
  • 3
  • 44
  • 57
  • 3
    But the resulting string might have bogus combinations of surrogates, so you can't count on it being transmitted properly. – Mechanical snail Aug 06 '12 at 03:20
  • @Mechanicalsnail I have posted example code which works. Can you give an example of where this would fail? – obataku Aug 06 '12 at 04:47
  • 1
    If any of the 16-bit blocks is in {0xFDD0, ..., 0xFDEF, 0xFFFE, 0xFFFF}, the output contains a non-character. If any is in {0xD800, ..., 0xDFFF}, the output is almost certain to contain an invalid UTF-16 sequence. In either case, in transmission all bets are off. – Mechanical snail Aug 06 '12 at 04:52
  • 1
    In particular, I think a compliant UTF-8 encoder will barf on that input. – Mechanical snail Aug 06 '12 at 04:52