2

I have three uint32_t that when combined, they will generate a unique key. I have to do this about 100M or more & potentially several times a day and store that in a key-value database. I'd like to keep the key to the least amount of bytes possible. I'm doing it in the following way but I'm curious if there's is a faster way to do this.

char *key = xmalloc(snprintf(NULL, 0, "%" PRIu32 "-%" PRIu32 "-%" PRIu32,num1,num2,num3) + 1);   
sprintf(key, "%" PRIu32 "-%" PRIu32 "-%" PRIu32, num1,num2,num3);
Peter D
  • 47
  • 5
  • 4
    I'd skip the allocation and use a fixed buffer (since 4294967295 is the maximum value for an uint32_t, you know that the maximum length of `key` is 32, the length of '4294967295-4294967295-4294967295'), but that obviously depends on how `key` is used later on... – AKX Apr 18 '21 at 17:01
  • Thanks for pointing that out. I forgot to mention, I'm trying to store the least amount of bytes once the key has been generated. So if I understand correctly, having a fixed buffer means that `key` would always need to be 32 bytes even for smaller uint32_t, e.g., 1-3-5, correct? – Peter D Apr 18 '21 at 17:24
  • @PeterD: No. The key is going to be as long as the string is. The database cannot know if you allocated 4 bytes or 1 GB for the key. – Yakov Galka Apr 18 '21 at 17:26

2 Answers2

5
  • Converting to decimal representation is rather costly. You can get faster conversion if you use hexadecimal:

      sprintf(key, "%" PRIx32 "-%" PRIx32 "-%" PRIx32, num1, num2, num3);
    
  • As @AKX mentioned, use a fixed sized buffer. Since the string is (presumably) copied into the database, you shouldn't worry about it taking more space than necessary in the DB:

      char key[32];
      snprintf(key, sizeof(key), "%" PRIx32 "-%" PRIx32 "-%" PRIx32, num1, num2, num3);
    

    The database engine doesn't know that you over-allocated your buffer. It will allocate it's own memory based on the actual length of the string rather than the size of the buffer.

  • Implement your own hexadecimal formatting. snprintf needs to parse its format string and interpret it against the argument list at runtime. This has a non-negligible overhead for such tasks like yours. Instead you can make your own int32-to-hex conversion that's specialized for your task. I'd use "abcdefghijklmnop" for the digits instead of the traditional "0123456789abcdef".

  • Does your key-value database require text-encoded keys? If not, you can try a binary encoding for your keys (e.g. take a look at SQLite4 varint encoding for inspiration).

Yakov Galka
  • 70,775
  • 16
  • 139
  • 220
  • Thanks for sharing this! I don't think the keys need to be text-encoded keys but I'd need to check. I like what SQLite 4 has done there. – Peter D Apr 18 '21 at 17:54
  • `Implement your own hexadecimal formatting.` -> Ascii85 is printable and trivial – KamilCuk Apr 19 '21 at 20:18
  • 1
    @KamilCuk: that's optimized for size, not for speed. Even base64 is likely to be slower here than hexadecimal. The OP indicated that performance is their primary concern, whereas size comes second. In either case binary encoding will beat any text representation. – Yakov Galka Apr 19 '21 at 20:46
0

If you prefer text-encoded keys, I would take Yakov's suggestion a step further (well, TWO steps) and use base64 encoding instead of hex. This way you will pack 6 bits into one character instead of only 4.

The implementation would have multiple bitshifts plus lookup table. I bet it will be faster than printf.

Vlad Feinstein
  • 10,960
  • 1
  • 12
  • 27