iterate over characters blocks in utf-8

Question

My task is to iterate over all the utf-8 character codes corresponding to a given language (locale). I suppose it's not that easy and I have to iterate over characters blocks (like the whole cyrilic for "ru_RU", for example). I can find characters blocks on the wiki page https://en.wikipedia.org/wiki/UTF-8, but I hope there are better ways than inventing my own bicycle.

I've had a look at icu-project, but I can't figure out if I can do what I need.

What I want to have as result is something like this:

for (unsignet int=UBLOCK_GREEK_EXTENDED; i<UBLOCK_GREEK_EXTENDED_SIZE; i++) {
    // do stuff
}

icu-project is a very powerfull tool, so I hope someone know how to do this :)

UPDATE: I'm working on a localization options for a 3D framework for mobile devices. It rasterizes and encodes truetype fonts so they can be easily rendered by picking required images from rasterized fonts files. Since I have to care about memory amount, I want to split rasterized font in different files for different locales (or languages, or characters blocks like cirylic or greek), so I don't have to keep the whole utf-8 font in memory all the time, but only load corresponding file after detecting locale.

Thanks!

Why do you want to do this, aka what problem are you trying to solve? http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378 — Danack, Jun 16 '13 at 14:06
So you want a list of which characters are used by given language? — Danack, Jun 16 '13 at 14:55
This would be the best option. Characters = character codes in utf-8, but i suppose I can get codes using icu. — Alexander, Jun 16 '13 at 14:59
Why are you (singular, as the implementer of this function) concerned about which locale your caller is using? Are you trying to perform validation at the same time as iteration? Our were you thinking you needed locale info to tell how many bytes the characters are from their upper bits. — kfsone, Jun 16 '13 at 19:12
I don't understand what you mean :) I need to have a possibility to load different fonts resources for different languages. And I have to know which characters have to be kept in those resources and so on. — Alexander, Jun 16 '13 at 19:34
I've got an answer using icu library already. Thank you for the interest! — Alexander, Jun 16 '13 at 19:34

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

So, I've finaly fund the way to do it properly usind the icu-project library http://site.icu-project.org.

Here is an example solution. You specify locale or language and get an array of utf-8 character blocks that contain symbols relative to the locale/language. You can then get start and end for each characters block.

UErrorCode err = U_ZERO_ERROR;
const int32_t capacity = 10;
const char* shortname = NULL;
int32_t num, j;
int32_t strLength = 4;
UScriptCode script[10] = {USCRIPT_INVALID_CODE};
num = uscript_getCode("en", script, capacity, &err);
UnicodeString temp = UnicodeString("[", 1, US_INV);
UnicodeString pattern;
for(j=0; j<num; j++) {
    shortname = uscript_getShortName(script[j]);
    UnicodeString str(shortname, strLength, US_INV);
    temp.append("[:");
    temp.append(str);
    temp.append(":]+");
}
pattern = temp.remove(temp.length()-1,1);
pattern.append("]");

UnicodeSet cnvSet(pattern, err);
printf("Number of script code associated are : %d \n", num);
printf("Range count: %d\n", cnvSet.getRangeCount());
printf("Set size: %d\n", cnvSet.size());
for(int32_t i=0; i<cnvSet.getRangeCount(); i++) {
    printf("Range start: %x\n", cnvSet.getRangeStart(i));
    printf("Range end: %x\n", cnvSet.getRangeEnd(i));
}

Results for language "en" from this example:

Number of script code associated are : 1

Range count: 30

Set size: 1272

Range start: 41

Range end: 5a

Range start: 61

Range end: 7a

...

Range start: ff41

Range end: ff5a

Which means all the characters ranges that correspong to the Latin block.

score 0 · Answer 2 · answered Jun 16 '13 at 14:01

It isn't exactly clear what you mean as although there are sections in the UTF mapping specifically aimed at some languages - e.g. as you say for Greek - there are a lot of languages for which the characters are split over a number of different areas - e.g. many European languages use the ASCII letters - A-Z et al - and also selected characters from the "extended Latin1" set in the 160-240 area.

So any tool to "iterate over" say Rumanian will have to first decide which characters the Rumanian ones are, then identify them in UTF, then print them.

If you don't mean that at all, but rather want to print out specific groupings from UTF, I would suggest you consider using UTF32 as your base encoding, in which printing characters will be much easier.

score 0 · Answer 3 · answered Jun 16 '13 at 15:01

0

The list of where language blocks are in unicode are listed here and so you'll be able to split most of the characters for characters to their own file.

You'll need to list which characters are available in each rendered font file, and then load the appropriate font files for the characters in each string that is rendered.

However - doing this dynamically may not be a great idea as it could be slow (checking each character) as well as prone to failure when characters slip in that aren't in any character set.

You may be better off doing it the other way round; when someone initialises your engine they list which language blocks you should load, and load the appropriate files. Then when you render strings, just drop any character that isn't currently available.

answered Jun 16 '13 at 15:01

Danack

24,939
16
90
122

That is exectly what I want to do. The question is if I can use some library, like icu, to get characters codes for requested language. I can use the document you've mentioned and define over 9000 constants... But I'd prefer to use library :) How to I load fonts and detect which one should be used in runtime is not a problem. – Alexander Jun 16 '13 at 15:10
"I can use the document...and define over 9000 constants" Erm, how about defining them as ranges? Cyrillic = U+0400 -> U+04FF – Danack Jun 16 '13 at 15:12
I understand what you mean and I agree. Maybe you are right and it's not that much. – Alexander Jun 16 '13 at 15:14
Yeah - there's only 220 blocks in total listed there, and most of them are irrelevant to you e.g. domino blocks http://www.fileformat.info/info/unicode/block/domino_tiles/images.htm – Danack Jun 16 '13 at 15:19
Now I understand the question better I would advise against this approach. While you may be able to say "my device is UK so I don't need cyrillic" for example, there are today many cases of oddball characters being required, unless you totally control the text to print. For example consider a French national needing accented letters for their name but living and using it in English? Better, I would suggest, to implement a decent font cache to read in only characters that are used. By all means pre-fill the cache with characters for the expected locale... – rivimey Jun 16 '13 at 15:26

score 0 · Answer 4 · answered Jun 22 '13 at 02:05

The characters actually used in a language can be found in the exemplar sets, defined in CLDR.

Instead of building up a complex UnicodeSet, I would just iterate over u+0000…u+10fff and test the script returned by uscript_getScript (UChar32 codepoint, UErrorCode *err) - the UnicodeSet is going to do about the same, internally, for the sample code you gave as your answer.

iterate over characters blocks in utf-8

4 Answers4