Storing UTF-8 string in a UnicodeString

Question

In Delphi 2007 you can store a UTF-8 string in a WideString and then pass that onto a Win32 function, e.g.

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

Delphi 2007 does not interfere with the contents of UTF8Str, i.e. it is left as a UTF-8 encoded string stored in a WideString.

But in Delphi 2010 I'm struggling to find a way to do the same thing, i.e. store a UTF-8 encoded string in a WideString without it being automatically converted from UTF-8. I cannot pass a pointer to a UTF-8 string (or RawByteString), e.g. the following will obviously not work:

var
  UnicodeStr: WideString;
  UTF8Str: UTF8String;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

FYI, the original 2007 code DOES interfere with the UTF-8 data. In 2007, `UTF8Encode()` returned a UTF-8 encoded `AnsiString`. In every version, assigning an `AnsiString` to a `WideString` performs an Ansi->UTF16 conversion using the OS default Ansi codepage. The final `WideString` DOES NOT contain UTF-8 data in it. It contain UTF-16 data. The conversion has no concept that UTF-8 was present, and thus will likely corrupt the data if the original input uses any non-ASCII characters. — Remy Lebeau, Jul 07 '12 at 08:39

score 13 · Accepted Answer · answered Apr 23 '10 at 13:13

13

Your original Delphi 2007 code was converting the UTF-8 string to a widestring using the ANSI codepage. To do the same thing in Delphi 2010 you should use SetCodePage with the Convert parameter false.

var
  UnicodeStr: UnicodeString;
  UTF8Str: RawByteString;
begin
  UTF8Str := UTF8Encode('some unicode text');
  SetCodePage(UTF8Str, 0, False);
  UnicodeStr := UTF8Str;
  Windows.SomeFunction(PWideChar(UnicodeStr), ...)

answered Apr 23 '10 at 13:13

Zoë Peterson

13,094
2
44
64

Nice. Didn't know about that :) – Runner Apr 23 '10 at 13:23

score 3 · Answer 2 · edited Jan 27 '16 at 16:21

Hmm, why are you doing that? Why are you encoding a WideString to UTF-8 just to store it again back to WideString. You are obviously using a Unicode version of the Windows API. So there is no need to use a UTF-8-encoded string. Or am I missing something.

Because Windows API functions are either Unicode (two bytes) or ANSI (one byte). UTF-8 would be wrong choice here, because mainly it contains one byte per character, but for characters above the ASCII base it uses two or more bytes.

Otherwise the equivalent for your old code in unicode Delphi would be:

var
  UnicodeStr: string;
  UTF8Str: string;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

WideString and string (UnicodeString) are similar, but the new UnicodeString is faster because it is reference-counted and WideString is not.

You code was not correct because the UTF-8 string has a variable number of bytes per character. "A" is stored as one byte. Just an ASCII byte code. "ü" on the other hand would be stored as two bytes. And because you are then using PWideChar the function always expects two bytes per character.

There is another difference. In older Delphi versions (ANSI) Utf8String was just an AnsiString. In Unicode versions of Delphi Utf8String is a string with a UTF-8 code page behind it. So it behaves differently.

The old code would still work correctly:

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

It would act the same as it did in Delphi 2007. So maybe you have a problem elsewhere.

Mick you are correct. The compiler does some extra work behind the scenes. So in order to avoid this you can do something like this:

var
  UTF8Str: AnsiString;
  UnicodeStr: WideString;
  TempString: RawByteString;
  ResultString: WideString;
begin
  UnicodeStr := 'some unicode text';
  TempString := UTF8Encode(UnicodeStr);
  SetLength(UTF8Str, Length(TempString));
  Move(TempString[1], UTF8Str[1], Length(UTF8Str));
  ResultString := UTF8Str;
end;

I checked, and it works just the same. Because I move bytes directly in memory there is no codepage conversion done in the background. I am sure it can be done with greater eleganece, but the point is that I see this as the way for what you want to achieve.

Yes, there is a codepage conversion done, on the very last line when assigning the temp `AnsiString` to the final `WideString`. The same was true in the original D2007 code. But on a side note, you can avoid the temp `AnsiString` by using `SetCodePage()` on the `RawByteString`, then you can assign the `RawByteString` to the `WideString`. — Remy Lebeau, Jul 07 '12 at 22:37

score 0 · Answer 3 · edited Jan 27 '16 at 16:16

0

Which Windows API call wants you to pass a UTF-8 string? It is either an ANSI string or a Widestring (A or W functions). Widestrings have two bytes per character, and UTF-8 strings have one (or more if you beyond the first 128 ASCII characters).

UTF-8 in an Widestring just doesn't make sense. When there is really a Windows function that wants a pointer to an UTF-8 string, you probably have to cast is to a PAnsiChar.

edited Jan 27 '16 at 16:16

Peter Mortensen

30,738
21
105
131

answered Apr 23 '10 at 11:12

The_Fox

6,992
2
43
69

1

It's some (broken) legacy code using INI files. So the section, for example, is being passed as a UTF8 string. I know this is wrong, but I need to keep it like that to import old settings files. If I pass Unicode for the section name then it won't match. I cannot use the ANSI versions because the filename is Unicode. – Mick Apr 23 '10 at 11:16

Storing UTF-8 string in a UnicodeString

3 Answers3

Linked

Related