How to compress Non-ASCII characters to 1 byte in C for Linux?

Question

I have a list of Turkish words. I need to compare their lengths. But since some Turkish characters are non-ASCII, I can't compare their lengths correctly. Non-ASCII Turkish characters holds 2 bytes.

For example:

#include <stdio.h>
#include <string.h>

int main()
{
    char s1[] = "ab";
    char s2[] = "çş";

    printf("%d\n", strlen(s1)); // it prints 2
    printf("%d\n", strlen(s2)); // it prints 4

    return 0;
}

My friend said it's possible to do that in Windows with the line of code below:

system("chcp 1254");

He said that it fills the Turkish chars to the extended ASCII table. However it doesn't work in Linux.

Is there a way to do that in Linux?

It all depends from the encoding you are using. If you use UTF-8 (which is the norm in Linux) determining the number of code points encoded in a string is not terribly complicated; [here](https://stackoverflow.com/a/44998716/214671) are the basics (it's C++, but the core of the matter should be clear enough). — Matteo Italia, Dec 02 '17 at 11:49
It depends on the encoding of your Turkish characters as to how many bytes they occupy. Ideally, you would be using UTF-8 encoding, which it probably is already, but is **variable** length! cp1254 on the other hand is an 8bit (1byte) character set and is incompatible with UTF-8. (And there's no such thing as "extended" ASCII). — Alastair McCormack, Dec 02 '17 at 11:49
Promote it to utf-16, normalize it to the NFC form., then count the 2 byte characters. This will be sufficient for most alphabets. — Dragonthoughts, Dec 02 '17 at 11:49
Here is a great article about this topic that you should read : https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ — vmonteco, Dec 02 '17 at 12:11
@AlastairMcCormack Not only is there such a thing as Extended ASCII there are several such things ... there is no single thing which is perhaps what you meant... — Ben, Dec 02 '17 at 12:12
@Ben true - it's a misnomer that means everything and nothing at the same time. What I meant is that there's no standard called "Extended ASCII". — Alastair McCormack, Dec 02 '17 at 12:25
`chcp 1254` sets the code page to Turkish on Windows console, telling that the higher characters on *one* byte (~ 0xA0-FF) have to be interpreted as Turkish (like the glyphs you see). Here you're on Linux, and the characters are utf8 encoded, counting them is pretty easy with the method given by @MatteoItalia above. — Déjà vu, Dec 02 '17 at 13:00
Curious: Why use `"%d"` with `strlen(s1)`? Other choices include `"%u"`, `"%zu"`, ... — chux - Reinstate Monica, Dec 02 '17 at 19:36

Basile Starynkevitch · Answer 1 · 2017-12-03T08:30:33.847

It's 2017 and soon 2018. So use UTF-8 everywhere (on recent Linux distributions, UTF-8 is the most common encoding, for most locale(7)-s, and certainly the default on your system); of course, an Unicode character coded in UTF-8 may have one to six bytes (so the number of Unicode characters in some UTF-8 string is not given by strlen). Consider using some UTF-8 library, like libunistring (or others, e.g. in Glib).

The chcp 1254 thing is some Windows specific stuff irrelevant on UTF-8 systems. So forget about it.

If you code a GUI application, use a widget toolkit like GTK or Qt. They both do handle Unicode and are able to accept (or convert to UTF-8). Notice that even simply displaying Unicode (e.g. some UTF-8 or UTF-16 string) is non trivial, because a string could mix e.g. Arabic, Japanese, Cyrillic and English words (that you need to display in both left-to-right and right-to-left directions), so better find a library (or other tool, e.g. a UTF-8 capable terminal emulator) to do that.

If you happen to get some file, you need to know the encoding it is using (and that is only some convention that you need to get and follow). In some cases, the file(1) command might help you guessing that encoding, but you need to understand the encoding convention used to make that file. If it is not UTF-8 encoded, you can convert it (provided you know the source encoding), perhaps with the iconv(1) command.

score 2 · Accepted Answer · answered Dec 02 '17 at 12:35

One possibility could be to use wide character strings to store words. It does not store characters as one byte but it solves your main problem. To get a set of functions working with your language. The program would look like the following:

#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main()
{
    wchar_t s1[] = L"ab";
    wchar_t s2[] = L"çş";

    printf("%d\n", wcslen(s1)); // it prints 2
    printf("%d\n", wcslen(s2)); // it prints 2

    return 0;
}

How to compress Non-ASCII characters to 1 byte in C for Linux?

2 Answers2