1

Wondering how typically a unicode code point lookup table is done. That is, given a character such as a, return U+24B6, or vice versa. Wondering if there are any efficient tricks so that it doesn't just boil down to:

a: U+24B6
b: ...
c: ...

Which would take up a lot of file size (and memory). Maybe there is a compact way to represent it in a file (not sure if that's what this is doing), which then gets expanded to a larger memory at runtime.

for x in y:
  map[x | something] = U + x + 123

Or maybe there is a way to keep it minimal even at runtime, so it is dynamically computed somehow.

Lance
  • 75,200
  • 93
  • 289
  • 503

1 Answers1

0

First, in case you want to map a code point to another then there's absolutely no need to map to a string like U + x + value. Simply store the code points directly in a map from char to char (char here is a type that is large enough to store all Unicode code points, for example std::unordered_map<int32_t, int32_t> in C++)

map['a'] = 0x24B6;
map['x'] = 123;

It looks like in the iconv-lite repo above, code points are stored as strings like "8140" which is very inefficient

This is still too broad though, because it really depends on what you want to map. Different mappings have different ways to hash the input values (unless you want to use a sorted dictionary which is more memory efficient but slower). But if you want to map a to Ⓐ, b to Ⓑ, c to Ⓒ... then just a linear conversion is enough. Here's an example pseudo function that maps A-Z to ⓐ-ⓩ (0x24D0-0x24E9), a-z to Ⓐ-Ⓩ (0x24B6-0x24CF) in the Enclosed Alphanumerics block and 0-9 to - (0x1F101-0x1F10A)

func map(char input)
    if 'a' <= input && input <= 'z':
        return input - 'a' + 'Ⓐ'
    if 'A' <= input && input <= 'Z':
        return input - 'a' + 'ⓐ'
    if '0' <= input && input <= '9':
        return input - '0' + ''
    return '\0';

No lookup table is needed

phuclv
  • 37,963
  • 15
  • 156
  • 475