2

I am using Delphi 6.

I want to decode a Portuguese UTF-8 encoded string to a WideString, but I found that it isn't decoding correctly.

The original text is "ANÁLISE8". After using UTF8Decode(), the result is "ANALISE8". The symbol on top of the "A" disappears.

Here is the code:

var
  f : textfile;
  s : UTF8String;
  w, test : WideString;    
begin
  while not eof(f) do
  begin
    readln(f,s);
    w := UTF8Decode(s);

How can I decode the Portuguese UTF-8 string to WideString correctly?

Dalija Prasnikar
  • 27,212
  • 44
  • 82
  • 159
John Ken
  • 23
  • 4
  • Use MultiByteToWideChar – David Heffernan Oct 04 '17 at 09:31
  • Chances are you're file isn't written in UTF-8. Files written in UTF-8 typically have the 3-byte [byte-order-mark sequence](https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8) up front, and if a file doesn't it's safe to assume it's using the system's default Ansi codepage. In that case storing the data in a `UTF8String` doesn't _make_ it UTF-8... – Stijn Sanders Oct 04 '17 at 16:43
  • How did you determine that your code doesn't work. I bet that you converted the WideString to ANSI. – David Heffernan Oct 04 '17 at 20:25
  • 2
    @StijnSanders: "*Files written in UTF-8 typically have the 3-byte byte-order-mark sequence up front*" - actually, they don't, because the Unicode and UTF-8 specs *discourage* people from using a BOM with UTF-8 encoded files, for backwards compatibility with ASCII text files and legacy apps that don't know how to handle BOMs. *CAN* a BOM exist in a UTF-8 file? Yes. *DOES* a BOM exist in a UTF-8 file? Usually not, most of the time. – Remy Lebeau Oct 04 '17 at 20:51
  • Strange. **Every** file I have come across that is in UTF-8 had a BOM, and attempts of mine to do without caused trouble down the road... But I guess anyone's mileage may vary. – Stijn Sanders Oct 05 '17 at 05:39

1 Answers1

3

Note that the implementation of UTF8Decode() in Delphi 6 is incomplete. Specifically, it does not support encoded 4-byte sequences, which are needed to handle Unicode codepoints above U+FFFF. Which means UTF8Decode() can only decode Unicode codepoints in the UCS-2 range, not the full Unicode repertoire. Thus making UTF8Decode() basically useless in Delphi 6 (and all the way up to Delphi 2007 - it was finally fixed in Delphi 2009).

Try using the Win32 MultiByteToWideChar() function instead, eg:

uses
  ..., Windows;

function MyUTF8Decode(const s: UTF8String): WideString;
var
  Len: Integer;
begin
  Len := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), nil, 0);
  SetLength(Result, Len);
  if Len > 0 then
    MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), PWideChar(Result), Len));
end;

var
  f : textfile;
  s : UTF8String;
  w, test : WideString;
begin
  while not eof(f) do
  begin
    readln(f,s);
    w := MyUTF8Decode(s);

That being said, your ANÁLISE8 string falls within the UCS-2 range, so I tested UTF8Decode() in Delphi 6 and it decoded the UTF-8 encoded form of ANÁLISE8 just fine. I would conclude that either:

  • your UTF8String variable DOES NOT contain the UTF-8 encoded form of ANÁLISE8 to begin with (byte sequence 41 4E C3 81 4C 49 53 45 38), but instead contains the ASCII string ANALISE8 instead (byte sequence 41 4E 41 4C 49 53 45 38), which would decode as-is since ASCII is a subset of UTF-8. Double check your file, and the output of Readln().

  • your WideString contains ANÁLISE8 correctly as expected, but the way you are outputting/debugging it (which you did not show) is converting it to ANSI, losing the Á during the conversion.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Hi, thanks very much, I just convert the UTF-8 code to HEX and see the (byte sequence 41 4E C3 81 4C 49 53 45 38), so i think the file must be in UTF-8 format ... thank you for the function MyUTF8Decode, just used it instead of UTF8Decode and the result is same, the Portuguese char turns into English – John Ken Oct 06 '17 at 08:43
  • @JohnKen then the problem must be related to whatever you are doing with the data after it has been decoded. Whatever you are doing with it is converting it to ANSI/ASCII. – Remy Lebeau Oct 06 '17 at 08:51
  • then how can i convert it to UTF8String :P ..... or store it as a UTF8 wideString – John Ken Oct 06 '17 at 08:53
  • @JohnKen you verified the **file** contains the correct UTF-8 bytes. Did you verify (with the debugger) that the `UTF8String` contains the same bytes? If so, then `Readln()` is not converting the data, so did you verify (with the debugger) that the `WideString` contains the correct decoded chars? `0041 004E 00C1 004C 0049 0053 0045 0038`. If so, then the `WideString` is fine (`MultiByteToWideChar()` is not broken). What you do with the `WideString` **after decoding it** is what is converting the data to ANSI/ASCII. You have yet to show any of THAT code. – Remy Lebeau Oct 06 '17 at 17:14