1

I ported an application from Windows to Linux and I encountered a problem with character encoding: I saw that accented letters (e.g. 'é' 'à') are considered as wchar_t (4 bytes with g++) whereas under Visual Studio, they take 1 byte (char). My unit tests failed because in my code I have character comparisons using accented letters (as in Linux they are multibyte).

Is it possible to cast accented letters (like 'û') to the Windows encoding (1 byte) in Linux or should I refactor my code and use std::wstring instead?

dda
  • 6,030
  • 2
  • 25
  • 34
Aminos
  • 754
  • 1
  • 20
  • 40
  • 2
    `wchar_t` is 2 bytes (UTF-16) on Windows, 4 bytes (UTF-32) on other systems. If the data really is using 1 byte characters, then it is using `char` instead, and as such is subject to charset codepage handling. To port that to Linux, you should re-encode the data to a UTF encoding (8, 16, or 32) and then use portable Unicode comparisons. – Remy Lebeau Dec 19 '16 at 22:37
  • A french accented letter size is one byte (char) on my Windows system, that's why in my code I was able to use them in character comparisons (e.g. if (strMystring[i] == 'é') ...) Linux text editors messed with the original encoding, and now my unit tests are failing, so I am obliged to find a suitable portable solution. – Aminos Dec 19 '16 at 22:41
  • 2
    `if (strMystring[i] == 'é')` would only work if `strMystring` is a latin-encoded 8bit string and the source file itself is encoded in latin to match. That is not a very portable setup. I would definitely not recommend comparing non-ASCII data using narrow strings, since that is locale-specific, unless you are using UTF-8. Use wide strings instead. – Remy Lebeau Dec 19 '16 at 22:43
  • 1
    Yes, you need to refactor your code. There's a reason wide characters were invented, and it's because 8 bit characters are really only good for ASCII. Even though other characters might seem to work, they won't on other systems, even different versions of Windows! – Mark Ransom Dec 19 '16 at 22:45
  • As sizeof('é') = 4 under Ubuntu and sizeof('é') = 1 under Windows, how can I write a portable solution to perform character comparisons like if (strMystring[i] == 'é') – Aminos Dec 19 '16 at 22:46

1 Answers1

1

If 'é' can be stored on one character on Windows, your application was probably compiled without UNICODE and certainly with a Win 1252 encoding.

With the usual utf-8 encoding on linux, the 'é' should require 2 characters. This should cause a warning from the compiler. And if you would use the character obtained, it would represent only a part of the encoding, so that the char by char comparison would be meaningless.

If you want to keep your algorithms, using individual characters of a string, you'd better work with wchar_t and wstring (or event more portable: char32_t and u32string).

If you want to know more on character encoding and unicode with C++, I can only warmly recommend you the excellent video tutorial on unicode with C++ from James McNellis.

Christophe
  • 68,716
  • 7
  • 72
  • 138