How to convert widestring to string of unicode bytes?

Question

When i create a file in Notepad, containing (example) the string 1d and save as unicode file, i get a 6 bytes size file containing the bytes #255#254#49#0#100#0.

OK. Now I need a Delphi 6 function which takes (example) input the widestring 1d and returns the string containing #255#254#49#0#100#0 (and viceversa).

How? Thanks. D

Looks like you need to hire a programmer. Have you made any effort to do this yourself? We're not a code-writing service. — Ken White, Dec 01 '16 at 17:29
Possible duplicate http://stackoverflow.com/questions/12337123/widestring-to-string-conversion-in-delphi-7 — Ilyes, Dec 01 '16 at 17:41
#255#254 is [the BOM for the file](https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16) (specifically UTF-16LE) - it has nothing to do with the string itself. If you're trying to manipulate Unicode it's probably worth learning how it works first. — J..., Dec 01 '16 at 18:59

Remy Lebeau · Answer 1 · 2016-12-05T21:38:02.077

It is easier to read bytes if you use hex. #255#254#49#0#100#0 is represented in hex as

FF FE 31 00 64 00

Where

FF FE is the UTF-16LE BOM, which identifies the following bytes as being encoded as UTF-16 using values in Little Endian.

31 00 is the ASCII character '1'

64 00 is the ASCII character 'd'.

To create a WideString containing these bytes is very easy:

var
  W: WideString;
  S: String;
begin
  S := '1d';
  W := WideChar($FEFF) + S;
end;

When an AnsiString (which is Delphi 6's default string type) is assigned to a WideString, the RTL automatically converts the AnsiString data from 8-bit to UTF-16LE using the local machine's default Ansi charset for the conversion.

Going the other way is just as easy:

var
  W: WideString;
  S: String;
begin
  W := WideChar($FEFF) + '1d';
  S := Copy(W, 2, MaxInt);
end;

When you assign a WideString to an AnsiString, the RTL automatically converts the WideString data from UTF-16LE to 8-bit using the default Ansi charset.

If the default Ansi charset is not suitable for your needs (say the 8-bit data needs to be encoded in a different charset), you will have to use the Win32 API MultiByteToWideChar() and WideCharToMultiByte() functions directly (or 3rd party library with equivalent functionality) so you can specify the desired charset/codepage as needed.

Now then, Delphi 6 does not offer any useful helpers to read Unicode files (Delphi 2009 and later do), so you will have to do it yourself manually, for example:

function ReadUnicodeFile(const FileName: string): WideString;
const
  cBOM_UTF8: array[0..2] of Byte = ($EF, $BB, $BF);
  cBOM_UTF16BE: array[0..1] of Byte = ($FE, $FF);
  cBOM_UTF16LE: array[0..1] of Byte = ($FF, $FE); 
  cBOM_UTF32BE: array[0..3] of Byte = ($00, $00, $FE, $FF);
  cBOM_UTF32LE: array[0..3] of Byte = ($FF, $FE, $00, $00);
var
  FS: TFileStream;
  BOM: array[0..3] of Byte;
  NumRead: Integer;
  U8: UTF8String;
  U32: UCS4String;
  I: Integer;
begin
  Result := '';
  FS := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite);
  try
    NumRead := FS.Read(BOM, 4);

    // UTF-8
    if (NumRead >= 3) and CompareMem(@BOM, @cBOM_UTF8, 3) then
    begin
      if NumRead > 3 then
        FS.Seek(-(NumRead-3), soCurrent);
      SetLength(U8, FS.Size - FS.Position);
      if Length(U8) > 0 then
      begin
        FS.ReadBuffer(PAnsiChar(U8)^, Length(U8));
        Result := UTF8Decode(U8);
      end;
    end

    // the UTF-16LE and UTF-32LE BOMs are ambiguous! Check for UTF-32 first...

    // UTF-32
    else if (NumRead = 4) and (CompareMem(@BOM, cBOM_UTF32LE, 4) or CompareMem(@BOM, cBOM_UTF32BE, 4)) then
    begin
      // UCS4String is not a true string type, it is a dynamic array, so
      // it must include room for a null terminator...
      SetLength(U32, ((FS.Size - FS.Position) div SizeOf(UCS4Char)) + 1);
      if Length(U32) > 1 then
      begin
        FS.ReadBuffer(PUCS4Chars(U32)^, (Length(U32) - 1) * SizeOf(UCS4Char));
        if CompareMem(@BOM, cBOM_UTF32BE, 4) then
        begin
          for I := Low(U32) to High(U32) do
          begin
            U32[I] := ((U32[I] and $000000FF) shl 24) or
                      ((U32[I] and $0000FF00) shl 8) or
                      ((U32[I] and $00FF0000) shr 8) or
                      ((U32[I] and $FF000000) shr 24);
          end;
        end;
        U32[High(U32)] := 0;
        // Note: UCS4StringToWidestring() does not actually support UTF-16,
        // only UCS-2! If you need to handle UTF-16 surrogates, you will
        // have to convert from UTF-32 to UTF-16 manually, there is no RTL
        // or Win32 function that will do it for you...
        Result := UCS4StringToWidestring(U32);
      end;
    end

    // UTF-16
    else if (NumRead >= 2) and (CompareMem(@BOM, cBOM_UTF16LE, 2) or CompareMem(@BOM, cBOM_UTF16BE, 2)) then
    begin
      if NumRead > 2 then
        FS.Seek(-(NumRead-2), soCurrent);
      SetLength(Result, (FS.Size - FS.Position) div SizeOf(WideChar));
      if Length(Result) > 0 then
      begin
        FS.ReadBuffer(PWideChar(Result)^, Length(Result) * SizeOf(WideChar));
        if CompareMem(@BOM, cBOM_UTF16BE, 2) then
        begin
          for I := 1 to Length(Result) then
          begin
            Result[I] := WideChar(
                           ((Word(Result[I]) and $00FF) shl 8) or
                           ((Word(Result[I]) and $FF00) shr 8)
                         );
            end;
        end;
      end;
    end

    // something else, assuming UTF-8
    else
    begin
      if NumRead > 0 then
        FS.Seek(-NumRead, soCurrent);
      SetLength(U8, FS.Size - FS.Position);
      if Length(U8) > 0 then
      begin
        FS.ReadBuffer(PAnsiChar(U8)^, Length(U8));
        Result := UTF8Decode(U8);
      end;
    end;
  finally
    FS.Free;
  end;
end;

Update: if you want to store UTF-16LE encoded bytes inside of an AnsiString variable (why?), then you can Move() the raw bytes of a WideString's character data into the memory block of an AnsiString: eg:

function WideStringAsAnsi(const AValue: WideString): AnsiString;
begin
  SetLength(Result, Length(AValue) * SizeOf(WideChar));
  Move(PWideChar(AValue)^, PAnsiChar(Result)^, Length(Result));
end;

var
  W: WideString;
  S: AnsiString;
begin
  W := WideChar($FEFF) + '1d';
  S := WideStringAsAnsi(W);
end;

I would not suggest misusing AnsiString like this, though. If you need bytes, operate on bytes, eg:

type
  TBytes = array of Byte;

function WideStringAsBytes(const AValue: WideString): TBytes;
begin
  SetLength(Result, Length(AValue) * SizeOf(WideChar));
  Move(PWideChar(AValue)^, PByte(Result)^, Length(Result));
end;

var
  W: WideString;
  B: TBytes;
begin
  W := WideChar($FEFF) + '1d';
  B := WideStringAsBytes(W);
end;

Thank you. Can you give an example using MultiBytetowidechar() and viceversa API for this problem (other charset)? — dan matei, Dec 04 '16 at 10:25
sorry, in your second example length(s) returns 2, while i expected 4! (#49#0#100#0). — dan matei, Dec 04 '16 at 13:16
@danmatei there are plenty of examples of `MultiByteToWideChar` and `WideCharToMultiByte` if you look around and read the documentation. `Length()` returns the number of elements, not the number of bytes. `WideString` uses 16bit elements, `AnsiString` uses 8bit elements. `W` is a `WideString`, `S` is an `AnsiString`. `W` contains 3 elements (BOM, `1`, `d`). The converted `AnsiString` contains 2 elements (`1`, `d`), not 4 or 6. — Remy Lebeau, Dec 04 '16 at 17:15
the ANSIstring #255#254#49#0#100#0 contains 6 elements. this is what i want my function to return. how? — dan matei, Dec 05 '16 at 19:01

score 1 · Answer 2 · answered Dec 01 '16 at 19:41

A WideString is already a string of Unicode bytes. Specifically, in UTF16-LE encoding.

The two extra bytes you see in the Unicode file saved by Notepad are called a BOM - Byte Order Mark. This is a special character in Unicode that is used to indicate the order of bytes in the data that follows, to ensure that the string is decoded correctly.

Adding a BOM to a string (which is what you are asking for) is simply a matter of pre-fixing the string with that special BOM character. The BOM character is U+FEFF (that is the Unicode notation for the hex representation of a 'character').

So, the function you need is very simple:

function WideStringWithBOM(aString: WideString): WideString;
const
  BOM = WideChar($FEFF);
begin
  result := BOM + aString;
end;

However, although the function is very simple, this possibly isn't the end of the matter.

The string that is returned from this function will include the BOM and as far as any Delphi code is concerned that BOM will be treated as part of the string.

Typically you would only add a BOM to string when passing that string to some external recipient (via a file or web service response for example) if there is no other mechanism for indicating the encoding you have used.

Likewise, when reading strings from some received data which may be Unicode you should check the first two bytes:

If you find #255#254 ($FFFE) then you know that the bytes in the U+FEFF BOM have been switched (U+FFFE is not a valid Unicode character). i.e. the string that follows is UTF16-LE. Therefore, for a Delphi WideString you can discard those first two bytes and load the remaining bytes directly in to a suitable WideString variable.
If you find #254#255 then the bytes in the U+FEFF BOM have not been switched around. i.e. you know that the string that follows is UTF16-BE. In that case you again need to discard the first two bytes but when loading the remaining bytes into the WideString you must switch each pair of bytes around to convert from the UTF16-BE bytes to the UTF16-LE encoding of a WideString.
If the first 2 bytes are #255#254 (or vice versa) then you are either dealing with UTF16-LE without a BOM or possibly some other encoding entirely.

Good luck. :)

How to convert widestring to string of unicode bytes?

2 Answers2