c-file encoded in utf-16 is not read properly by gcc

Question

Doing some encoding tests, I saved a c-file with encoding 'UTF-16 LE' (using sublimeText).

The c file contains the following:

#include <stdio.h>

void main() {
    char* letter = "é";
    printf("%s\n", letter);
}

Compiling this file with gcc returns the error:

test.c:1:3: error: invalid preprocessing directive #i; did you mean #if?
    1 | # i n c l u d e   < s t d i o . h >

It's as if gcc inserted a space before each character when reading the c-file.

My question is: Can we submit c-files encoded in some format other than "utf-8" ? Why it was not possible for gcc to detect the encoding of my file and read it properly ?

"*It's as if gcc inserted a space before each character when reading the c-file*" - you created a source file encoded in UTF-16, which uses 2-byte character units. gcc read it assuming 1-byte characters. The "spaces" are the high bytes being 0x00 due to characters being in the ASCII range. Don't use UTF-16 for source files (at least without a BOM in front), most compilers can't handle it. "*Why it was not possible for gcc to detect the encoding of my file and read it properly ?*" - without a BOM, do you realize how difficult that really is, given the hundreds of charsets used in the world? — Remy Lebeau, Dec 04 '20 at 01:25
@RemyLebeau The file is utf-16 Little Endian, it contains the corresponding BOM — Ayoub Omari, Dec 04 '20 at 08:23
even with a BOM, that's no guarantee a compiler will support Unicode encoded source files. Check the compiler's documentation. In this case, see [g++ compiling sources in UTF-16 encoding](https://stackoverflow.com/questions/19617203/) and [How should I use g++'s -finput-charset compiler option correctly in order to compile a non-UTF-8 source file?](https://stackoverflow.com/questions/10345802/) — Remy Lebeau, Dec 04 '20 at 15:36

score 2 · Accepted Answer · answered Dec 03 '20 at 16:52

Because design choice.

From GNU Manual, Character-sets:

At present, GNU CPP does not implement conversion from arbitrary file encodings to the source character set. Use of any encoding other than plain ASCII or UTF-8, except in comments, will cause errors. Use of encodings that are not strict supersets of ASCII, such as Shift JIS, may cause errors even if non-ASCII characters appear only in comments. We plan to fix this in the near future.

GCC is born to create GNU, so from Unix world, where UTF16 is not an allowed character set (for standard files, and GNU pass sources files between different programs, e.g. CPP the preprocessor, GCC the compiler, etc.).

But also, who uses UTF16 for sources? And for C, which hates all the \0 in strings? The encoding of source code has nothing to do with the program (and do default locales for reading files, printing strings, etc.).

If it cause problem, just use a pre-preprocessor (which is not so uncommon), to change your source code in gcc useable code (but hidden to you, so you can continue edit in UTF16).

c-file encoded in utf-16 is not read properly by gcc

1 Answers1

Linked