Changing from Array of AnsiChar to Array of WChar

Question

I am XE5 user. I have a client/server app written in D7. I have upgraded to XE5. Because D7 was not unicode i have used following type:

TRap = array[0..254] of AnsiChar;

and i am sending this data to server over tcpip. Server has the same defination. Now i need to upgrade to unicode but size must be same. Because i am using the following modal:

PMovieData = ^TMovieData;
TMovieData = packed record
   rRap: TRap;
   rKey: string[7];
   iID: integer;
end;

i have tried to change TRap to this:

TRap2 = array[0..127] of WChar;

However sizes are not equal. TRap is 255 but TRap2 is 256. I can not change the size as it should work with ex version. Do you have any recommendations?

Getting the sizes to match is the least of your concerns. The fact that your text encodings don't match is the real problem. Perhaps UTF-8 could be your saviour here. But really you are reaping the result of a bad design choice. You should have serialized using JSON or perhaps some compact binary variant like BSON. Then you'd be immune to this. — David Heffernan, Sep 30 '14 at 11:00
And I bet that you send this data over the wire with the integer in little endian layout!! — David Heffernan, Sep 30 '14 at 11:07

score 1 · Accepted Answer · answered Sep 30 '14 at 15:19

Well, since wide characters are 2 bytes wide, an array of wide characters has even size. And 255 is not even. So you cannot overlay an array of wide characters over the original array.

I suppose you could have an array of 127 wide characters and a padding byte:

TMovieData = packed record
  rRap: array [0..126] of WideChar;
  _reserved: Byte;
  rKey: string[7];
  iID: integer;
end;

I cannot imagine this would help much since an old component would interpret the wide character data as 8 bit data. Which would give unexpected results. Basically the text would be garbled.

You might consider some other options:

Continue with an array of 8 bit characters but encode as UTF-8. This would be compatible over the ASCII range. But any text in the 128-255 range would be garbled. Of course at the moment your existing solution can only work if both client and server use the same 8 bit character set.
Switch all components to use a new record format. In fact if you did this you could abandon your rigid record format and serialize the record using JSON. Encode that as UTF-8 and transmit those bytes. This would also allow you to fix the problem with your current implementation that it doesn't respect network byte order.

score 1 · Answer 2 · answered Sep 30 '14 at 17:18

The code you showed works exactly the same way in XE5 as it did in D7. AnsiChar and short strings are still available and still 8bit Ansi, thus are the same size they have always been. You don't have to change your definitions at all...

... UNLESS you want to compile for mobile, in which case AnsiChar and short strings are no longer available (you can install an RTL patch to get them back), in which case you can change the code to the following to maintain compatibility with the server:

TRap = array[0..254] of Byte;

PMovieData = ^TMovieData;
TMovieData = packed record
   rRap: TRap;
   rKey: array[0..7] of Byte; // put the string length in rKey[0]
   iID: integer;
end;

Deltics · Answer 3 · 2014-09-30T22:21:43.370

You have mentioned in the question that you are sending the data held in this array data type to a server.

If that server implementation is under your control and is also being upgraded to Unicode then you simply need to ensure that both sides of the data exchange agree on the content of the data. If you need an array to hold a "string" of length up to 255 characters (which is what your ANSI implementation supported) then your Unicode version needs also to support 255 characters. What changes is the amount of data required for each character - 2 bytes, as opposed to 1 (it's not quite as simple as that but I am presuming that your ANSI implementation didn't deal with MBCS issues and that your Unicode implementation similarly won't be concerned with surrogate pairs (in terms of affecting the number of effective "characters").

i.e. your TRap array should be:

TRap = array[0..254] of WIDEChar;

However, if the server implementation is NOT under your control (which is hinted at by your observation that the new code must continue to work with the old version) then no matter what changes you make in the client application, the server will continue to expect 255 ANSI chars. In that case your TRap type must simply remain identical as before (an array of ANSIChar) and you must instead ensure that you convert your WIDE string characters to ANSI (and vice versa) as they pass in and out of the array.

NOTE: There is no point converting to UTF-8 into that array or any other contrivance simply to make the same number of chars "fit", unless the old version of the server that the new code will work with already accommodates receiving UTF-8 encoded characters (which it almost certainly does not, based on the information to hand).

Returning to the case where the server implementation is under your control, you could potentially maintain support for an ANSI TRap type as well as introducing a new WIDE char implementation by incorporating a "magic cookie" in the new WIDE char version of the data structure to allow the server to identify what type of data structure is being passed and adapt accordingly. Something along these lines:

TANSIRap = array[0..254] of ANSIChar;
TWIDERap = array[0..254] of WIDEChar;


// ANSIMovieData corresponds *exactly* to the old MovieData structure

PANSIMovieData = ^TANSIMovieData;
TANSIMovieData = packed record
   rRap: TANSIRap;
   rKey: string[7];
   iID: integer;
end;


// WIDEMovieData adds a magic cookie "header" identifying the new structure type

PWIDEMovieData = ^TWIDEMovieData;
TWIDEMovieData = packed record
   rCookie: Word;
   rRap: TWIDERap;
   rKey: string[7];
   iID: integer;
end;

When sending a TWIDEMovieData, set the rCookie field to some constant value that will not occur in a valid TANSIRap array. e.g. if an "empty" TRap array (i.e. the ANSI version) is all 0's then a cookie of $00FF could be suitable since no valid ANSI Movie Data structure could start with a leading #0 character immediately followed by an #ff character:

const 
  MOVIEDATA_WIDECOOKIE = $00FF;



// new client pseudo-code:

data: TANSIMovieData;
wdata: TWIDEMovieData;


if ANSI Server then

  data...             // init ANSI movie data (TANSIRap ANSI chars converted from WIDE)
  SendToServer(data); // etc

else // WIDE server

  wdata.rCookie := MOVIEDATA_WIDECOOKIE;

  wdata.....           // init WIDE movie data
  SendToServer(wdata); // etc



// server pseudo-code:

var
  data: PANSIMovieData;
  wdata: PWIDEMovieData;


ReceiveDataIntoBuffer(data^);

wdata := PWIDEMovieData(data);

if wdata.rCookie = MOVIEDATA_WIDECOOKIE then
   HandleWideData(wdata)
else
   ExistingANSIDataHandler(data);

The only complication then is how a new client can determine whether or not a server it is communicating with is capable of supporting WIDE movie data or not. But this is only an issue if the server implementation is under your control (if not then you can only continue to use ANSI anyway) so you should be able to contrive some mechanism for identifying server capabilities in a way that allows older servers to be reliably identified by new clients.

Absolute worst case you may need a client configuration setting (and notice that if you configure a client to use ANSI even for a new server it will still continue to work, just without Unicode support).

New client / New server : client and server both use WIDE (but will also work with ANSI)
New client / Old server : client uses ANSI
Old client / New server : server detects ANSI
Old client / Old server : no change

Depending on the server implementation you may need to read the data in stages, reading the first 2 bytes to obtain either the first two TANSIRap chars or the WIDECOOKIE depending on the data structure passed, and then reading the remaining bytes in the packet according to whether you detect the cookie or not, but the principle is essentially the same.

UTF-8 could be a useful bridge if it doesn't matter if non ascii characters may get garbled. It would allow you to support Unicode if both sides are upgraded, but have reduced functionality with partial upgrade. If both sides are to be changed then UTF-8 is generally better than UTF-16 because encoded text is usually smaller in size. And if we are going to start from scratch then packed records with short strings and host byte order values should go too! — David Heffernan, Sep 30 '14 at 22:35
If you are going to transcode from UTF16 to some other codepage, why deliberately choose one that you know the (old) server does not support ? The (old) server expects ANSI, and transcoding to ANSI is no more difficult than transcoding to UTF8. Certainly ANSI is not ideal. It immediately raises the question of which codepage to use, but the answer to that is clear from the problem: The aim is to reproduce the old behaviour and my money is on the old code simply using the default codepage. Easily preserved behaviour. Either way, choosing UTF8 in these circumstances simply makes no sense. — Deltics, Oct 01 '14 at 03:24
But as stated in the question, another goal is to support international text. Which cannot be done with ANSI. So we cannot do that and retain full backwards compat. — David Heffernan, Oct 01 '14 at 06:05
It is entirely wrong to say that ANSI does not support international characters. ANSI <> ASCII. Though they may not know it, the OP's code already supports (some) international characters, but does so with an ANSI codepage. A "partial upgrade" scenario (client to UTF8) could potentially result in a significant DOWNgrade (users would say "bugs") w.r.t existing data retrieved from an ANSI server. I explain how to retain 100% backwards compatibility (i.e. ANSI) AND achieve Unicode support in new versions if the server code is able to be modified. — Deltics, Oct 01 '14 at 19:27
OK, there's little point in this if you think ANSI is international or can meet the requirement to support Unicode as stated in the question. — David Heffernan, Oct 01 '14 at 19:30

Changing from Array of AnsiChar to Array of WChar

3 Answers3