0

I'm trying to apply Tencoding.UTF8.Getstring on a part of a static bytes array without copying its content to a dynamic array. If the array is dynamic, i have no problem using:

stringvar:=Tencoding.utf8.Getstring(dynbytearray,offset,length);

however, when i have a static predefined length array, the syntax doesn't work and all i could figure is to declare a new dynamic array, set its length, and copy the bytes. I don't like doing the needless copying since i suspect i just miss a syntax trick. My attempts like "setlength(newdynarr,whatever); newdynarr:=@staticarr [optional offset]" failed so far. Thanks.

miodrag
  • 99
  • 10

2 Answers2

3

The public TEncoding.GetString() method only supports dynamic arrays, but you can use the protected PByte overloads of TEncoding.GetCharCount() and TEncoding.GetChars() instead, eg:

type
  TEncodingHelper = class(TEncoding)
  public
    function GetString(Bytes: PByte; ByteCount: Integer): String;
  end;

function TEncodingHelper.GetString(Bytes: PByte; ByteCount: Integer): String;
begin
  SetLength(Result, GetCharCount(Bytes, ByteCount));
  GetChars(Bytes, ByteCount, PChar(Result), Length(Result));
end;

var
  S: string;
begin 
  S := TEncodingHelper(TEncoding.UTF8).GetString(PByte(@arr[index]), ByteCount);
end;

Or:

type
  TEncodingHelper = class helper for TEncoding
  public
    function GetString(Bytes: PByte; ByteCount: Integer): String;
  end;

function TEncodingHelper.GetString(Bytes: PByte; ByteCount: Integer): String;
begin
  SetLength(Result, Self.GetCharCount(Bytes, ByteCount));
  Self.GetChars(Bytes, ByteCount, PChar(Result), Length(Result));
end;

var
  S: string;
begin 
  S := TEncoding.UTF8.GetString(PByte(@arr[index]), ByteCount);
end;
David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Thx, works for now (tested with plain english only). Will comment later if i encounter problems with messy unicode i need to examine. – miodrag Jan 18 '17 at 22:53
0

You can use System.UnicodeFromLocaleChars. For instance like this:

uses
  SysUtils, SysConst, Windows;

function Utf8BytesToString(Bytes: PByte; ByteCount: Integer): string;
var
  Len: Integer;
begin
  Len := UnicodeFromLocaleChars(CP_UTF8, MB_ERR_INVALID_CHARS, Pointer(Bytes),
    ByteCount, nil, 0);
  if (ByteCount>0) and (Len=0) then begin
    raise EEncodingError.CreateRes(@SNoMappingForUnicodeCharacter);
  end;
  SetLength(Result, Len);
  UnicodeFromLocaleChars(CP_UTF8, MB_ERR_INVALID_CHARS, Pointer(Bytes),
    ByteCount, Pointer(Result), Len);
end;

The System.UnicodeFromLocaleChars function wraps MultiByteToWideChar on Windows and UnicodeFromLocaleChars on POSIX systems. The TEncoding class makes use of System.UnicodeFromLocaleChars to perform its conversions. Should you wish to travel in the opposite direction there is System.LocaleCharsFromUnicode.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • Undeclared identifier: 'UnicodeFromLocaleChars'. Perhaps i use a version too old (D2010). – miodrag Jan 18 '17 at 22:51
  • @miodrag: `UnicodeFromLocaleChars()` and `LocalCharsFromUnicode()` were introduced in XE. – Remy Lebeau Jan 18 '17 at 23:04
  • 1
    @DavidHeffernan: you mean `MultiByteToWideChar()` instead, since the OP is converting *from* UTF-8 *to* UTF-16. – Remy Lebeau Jan 18 '17 at 23:05
  • actually my data is very messy and i examine it manually first to determine if it's ansi or utf8 or unicode with or without bom (and which order), so i need to set the encoding myself. – miodrag Jan 18 '17 at 23:11
  • As I'm sure you know it's not possible to detect encoding perfectly. Even so you can perfectly well pass the encoding to MultiByteToWideChar once you have detected it. – David Heffernan Jan 20 '17 at 08:14