2

The conversion to/from WideString on OSX doesn't work for me.

In a large program, all strings are assumed to be UTF-8 and everything works fine. Recently, a library using WideString has been added and it works nicely on Linux, but fails on OSX. All non-ASCII chars get replaced by question marks (not by some strange chars, but exactly by U+003F), no matter what direction gets used. I could extract the problem into the following snippet:

VAR s: String; ws: WideString;
...


s := 'Maß'; // A string with some non-ASCII character

ws := LazUTF8.UTF8ToUTF16(s); // Works fine.
// ws := s; // *** Uncomment this and fail.
s := LazUTF8.UTF16ToUTF8(ws); // Works fine.
// s := ws; // *** Uncomment this and fail.
IF Pos('?', s) > 0 THEN RAISE Exception.Create('Blown!');

Uncommenting one or both lines marked with *** leads to an exception. It looks like some default ASCII-only conversion gets used, but I have no idea why as OSX is said to always use UTF-8.

Including cwstring changes nothing at all. There's /usr/lib/libiconv.2.4.0.dylib installed there. Adding it via -liconv explicitly changes nothing either.

It's Lazarus 1.2.6 und FPC 2.6.4 on OSX 10.8. Any way to make the automated conversion work?

maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • two things: [1] UTF8ToUTF16 returns an unicode string, not a widestring. Widestrings are not unicode strings: they are just like ansistring but with 2 bytes per chars (they are supposed to be already decoded). an unicode string is physically just like an ansistring but a char is not necessarly a code point, you may have less unicode chars contained in that string. [2]: endianess ? – Abstract type Mar 09 '15 at 15:26
  • @BBaz Thanks. I know that not everything fits into two bytes, but I don't care about such exotics. The difference between unicode string and widestring is not exactly clear to me. Endianess can't explain why question marks get generated by both conversions. It was hard debugging, but I've found the culprit. – maaartinus Mar 10 '15 at 01:10

1 Answers1

0

The culprit are the lines

iconv_wide2ansi:=iconv_open(nl_langinfo(CODESET), unicode_encoding2);
iconv_ansi2wide:=iconv_open(unicode_encoding2, nl_langinfo(CODESET));

in cwstring.pp together with nl_langinfo(CODESET) returning US-ASCII. No idea why it happens, but my solution was to make my own version of cwstring.pp with hardcoded UTF-8 in. This is surely not nice, but it's correct (the program always assumes Strings are UTF-8 and I really see no reason why program internals should be OS dependent) and works fine.

In the meantime the file in FPC has changed, so this might have been fixed already.

maaartinus
  • 44,714
  • 32
  • 161
  • 320