Short (ASCII, 7-bit per character) string storage and comparison optimization in C++

Question

In my project I'm using huge set of short strings in ASCII 7-bit and have to process (store, compare, search etc) these strings with maximum performance. Basically, I build some Index array of uint64_t type and each element stores 9 characters of a word and use that index as Numeric element for any string comparison operation. Current implementation works fast, but may be it's possible to improve it a bit if you will..

This function converts up to 9 initial characters to uint64_t value - any comparison of that number is equivalent of standard "strcmp" function.

#include <cstdint>
#include <iostream>

uint64_t cnv(const char* str, size_t len)
{
    uint64_t res = 0;

    switch (len)
    {
    default:
    case 9: res = str[8];
    case 8: res |= uint64_t(str[7]) << 7;
    case 7: res |= uint64_t(str[6]) << 14;
    case 6: res |= uint64_t(str[5]) << 21;
    case 5: res |= uint64_t(str[4]) << 28;
    case 4: res |= uint64_t(str[3]) << 35;
    case 3: res |= uint64_t(str[2]) << 42;
    case 2: res |= uint64_t(str[1]) << 49;
    case 1: res |= uint64_t(str[0]) << 56;
    case 0: break;
    }

    return res;
}

int main()
{
    uint64_t v0 = cnv("000", 3);
    uint64_t v1 = cnv("0000000", 7);

    std::cout << (v1 < v0);
}

This (whatever it is supposed to do) would appear to minimise performance. And what's your question? — , Feb 11 '18 at 00:38
That is only peace of code, in real system, that is part of numeric index to find some short string from billion items data set.. strcmp function call it too expensive. Question about "cnv" function only - how to optimize that 8 bit to 7 bit string transformation if I have string with length up to 9 bytes . — Iurii Gordiienko, Feb 11 '18 at 00:50
You said the idea is to find a string in a set of billions? What do you want to return? For example, it would be useless if find(bigarray, "banana") returned the string "banana". Because if you are trying to return the index of a string, then you should build an index. A trie is a pretty good index, so is a hashtable. Basically your code looks like a hash of the first 9 characters of a string and it otherwise feels like it needs the rest of the hashtable implementation. — Wyck, Feb 11 '18 at 02:11
It looks like a hash, but that is not a hash. My value it's a numeric "view" of actual string. For example, we have a constant storage of 100 millions short strings, each string it's a Key (ordered) of some associated data (like std::map). And we need to find some associated value using some input Key. To find a pair Key-Value we need to use "strcmp" function in binary search algorithm. strcmp is the bottleneck in my case. But if store first part of string key as I have in my example - it's enough to compare (operator< etc) each string using arithmetic operation (few asm instructions). — Iurii Gordiienko, Feb 11 '18 at 02:31
HASH table in my case not an option - it has lower performance because of cache misses (and some limitation for my particular case usage). — Iurii Gordiienko, Feb 11 '18 at 02:31

AndreyS Scherbakov · Answer 1 · 2018-02-11T02:19:51.150

You may load 8 bytes of an original string at once than condense them inside a resulting integer (and reverse them if your machine has a little-endian number representation).

#include <iostream>

uint64_t ascii2ulong (const char  *s, int len)
{
    uint64_t i = (*(uint64_t*)s);
    if (len < 8) i &= ((1UL << (len<<3))-1);
#ifndef BIG_ENDIAN
    i = (i&0x007f007f007f007fUL) | ((i & 0x7f007f007f007f00) >> 1);
    i = (i&0x00003fff00003fffUL) | ((i & 0x3fff00003fff0000) >> 2);
    i = ((i&0x000000000fffffffUL) << 7) | ((i & 0x0fffffff00000000) << (7-4));
    // Note: Previous line: an additional left shift of 7 is applied
    // to make room for s[8] character
#else
    i = ((i&0x007f007f007f007fUL) << 7)  | ((i & 0x7f007f007f007f00) >> 8);
    i = ((i&0x00003fff00003fffUL) << 14) | ((i & 0x3fff00003fff0000) >> 16);
    i = ((i&0x000000000fffffffUL) << (28+7)) | ((i & 0x0fffffff00000000) >> (32-7));
#endif

    if (len > 8) i |= ((uint64_t)s[8]);
    return i;
}


//Test
std::string ulong2str(uint64_t compressed) {
    std::string s;
    for (int i = 56; i >= 0; i-=7) 
        if (char nxt=(compressed>>i)&0x7f) s+= nxt;
    return s;
}
int main() {
    std::cout << ulong2str(ascii2ulong("ABCDEFGHI", 9))<<std::endl;
    std::cout << ulong2str(ascii2ulong("ABCDE", 5))<<std::endl;
    std::cout << (ascii2ulong("AB", 2) < ascii2ulong("B", 1))<<std::endl;
    std::cout << (ascii2ulong("AB", 2) < ascii2ulong("A", 1))<<std::endl;
    return 0;
}

But note: doing in such a way you formally violate allocated address ranges (if your original string has < 8 bytes allocated). If you run a program with memory sanity checking, it may produce a runtime error. To avoid this you may of course use memcpy to copy as many bytes as you need in place of uint64_t i = (*(uint64_t*)s);:

uint64_t i;
memcpy(&i,s,std::min(len,8));

If some hardware acceleration is used for memcpy at you machine (which is likely) it may be not bad in terms of efficiency.

I got your idea, thank you. VS generates not too compact code but in any way - will check... — Iurii Gordiienko, Feb 11 '18 at 02:57

Short (ASCII, 7-bit per character) string storage and comparison optimization in C++

1 Answers1

Linked