2

This unit test runs successfully with Free Pascal 3.0 in Delphi mode:

procedure TFreePascalTests.TestUTF8Decode;
var
  Raw: RawByteString;
  Actual: string;
begin
  Raw := UTF8Encode('关于汉语');

  Actual := string( UTF8Decode(Raw) ); // <--- cast from UnicodeString

  CheckEquals('关于汉语', Actual);

  // check Windows ANSI code page 
  CheckEquals(1252, GetACP);
  // check Free Pascal value (determines how CP_ACP is interpreted)
  CheckEquals(65001, DefaultSystemCodePage); 
end; 

UTF8Decode returns a UnicodeString. Without the hard type cast to string, the compiler warns about an unsafe conversion:

Warning: Implicit string type conversion with potential data loss from "UnicodeString" to "AnsiString"

(tested with Lazarus 1.6 / FPCUnit GUITestrunner)

As per http://wiki.freepascal.org/Character_and_string_types#String, the string type defaults to AnsiString (if the {$H+} switch is set to use AnsiString instead of ShortString).

It looks like Free Pascal stores the Unicode string in the AnsiString variable. (even without the cast, the test succeeds)

Question: as the test succeeds, can I assume that it is safe to use the cast (to suppress the warning) without risking data loss?

mjn
  • 36,362
  • 28
  • 176
  • 378

1 Answers1

3

The cast is not safe in general as you are still converting the UnicodeString into an AnsiString and the encoding of an AnsiString is not known at compile time. The warning goes only away as you are doing it explicitly and the compiler assumes you know what you do.

If the cast works depends on the encoding setting on your system: It is either UTF-8, then Actual contains the string UTF-8 encoded and it works or the particular locale on your system supports the characters you are using. If you run this code on a system with e. g. CP1250, it will fail. The governing variable is DefaultSystemCodePage. On startup it is initialized by the FPC RTL using the encoding of the system. However, there are frameworks (like the LCL) which override this and set it to e. g. UTF-8.

Use {$modeswitch unicodestrings} in addition to {$mode delphi} and string equals to unicodestring, so the encoding will be locale independent.

FPK
  • 2,008
  • 13
  • 19
  • `If you run this code on a system with e. g. CP1250, it will fail` - the code does not fail with CP1252 in my case (see my edit with additional environment checks in the unit test). It seems like Free Pascal 3 uses UTF-8 (code page 65001) with mode delphi. – mjn Mar 13 '16 at 09:58
  • This `{$mode delphi} begin writeln(DefaultSystemCodePage); end.` writes 1252 here. Are you sure you did not include additional units? – FPK Mar 13 '16 at 11:40
  • Yes, very likely one of the units which are required by FPCUnit (I use the GUI test runner) is responsible for this. I will check this later. My #1 suspect is the LCL unit 'Interfaces'. (This would also mean that the tested code only works as expected within FPCUnit as it is...) +1 – mjn Mar 13 '16 at 11:50
  • Ok, it was not clear from your code example (or I missed it) that you are using Lazarus units. `DefaultSystemCodePage` is set to UTF-8 by programs using the LCL (lazarus/components/lazutils/FPCAdds.pas line 71). – FPK Mar 13 '16 at 11:59