5

What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?

I am aware of the fact that this is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.

RRUZ
  • 134,889
  • 20
  • 356
  • 483
  • As far as I know, that's not any standard string format, so you'll have to decode it yourself. Which part are you having trouble with, decoding the escaped characters, or finding the corresponding combined character? – Rob Kennedy Nov 18 '10 at 14:54
  • Decoding the escaped characters is trivial; finding the corresponding combined character is the problem. But it looks like the WinAPI call NormalizeString as suggested by Roddy pointed me into the right direction. – Erwin Jurschitza Nov 18 '10 at 14:59

5 Answers5

4

I think you need to perform Unicode Normalization. on your string.

I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:

NormalizationKC

Unicode normalization form KC, compatibility composition. Transforms each base plus combining characters to the canonical precomposed equivalent and all compatibility characters to their equivalents. For example, the ligature fi becomes f + i; similarly, A + ¨ + fi + n becomes Ä + f + i + n.

Roddy
  • 66,617
  • 42
  • 165
  • 277
2

Here is the complete code that solved my problem:

function Unescape(const s: AnsiString): string;
var
  i: Integer;
  j: Integer;
  c: Integer;
begin
  // Make result at least large enough. This prevents too many reallocs
  SetLength(Result, Length(s));
  i := 1;
  j := 1;
  while i <= Length(s) do begin
    if s[i] = '\' then begin
      if i < Length(s) then begin
        // escaped backslash?
        if s[i + 1] = '\' then begin
          Result[j] := '\';
          inc(i, 2);
        end
        // convert hex number to WideChar
        else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s)) 
                and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin
          inc(i, 6);
          Result[j] := WideChar(c);
        end else begin
          raise Exception.CreateFmt('Invalid code at position %d', [i]);
        end;
      end else begin
        raise Exception.Create('Unexpected end of string');
      end;
    end else begin
      Result[j] := WideChar(s[i]);
      inc(i);
    end;
    inc(j);
  end;

  // Trim result in case we reserved too much space
  SetLength(Result, j - 1);
end;

const
  NormalizationC = 1;

function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer;
 lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';

function Normalize(const s: string): string;
var
  newLength: integer;
begin
  // in NormalizationC mode the result string won't grow longer than the input string
  SetLength(Result, Length(s));
  newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result));
  SetLength(Result, newLength);
end;

function UnescapeAndNormalize(const s: AnsiString): string;
begin
  Result := Normalize(Unescape(s));
end;

Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)

1

Are they always escaped like this? Always in a number of 4 digits?

How is the \ character itself escaped?

Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:

function Unescape(s: AnsiString): WideString;
var
  i: Integer;
  j: Integer;
  c: Integer;
begin
  // Make result at least large enough. This prevents too many reallocs
  SetLength(Result, Length(s));
  i := 1; j := 1;
  while i <= Length(s) do
  begin
     // If a '\' is found, typecast the following 4 digit integer to widechar
     if s[i] = '\' then
     begin
       if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
         raise Exception.CreateFmt('Invalid code at position %d', [i]);

       Inc(i, 6);
       Result[j] := WideChar(c);
     end
     else
     begin
       Result[j] := WideChar(s[i]);
       Inc(i);
     end;
     Inc(j);
  end;

  // Trim result in case we reserved too much space
  SetLength(Result, j-1);
end;

Use like this

  MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);

This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.

[edit] Code is ok. Highlighter fails.

GolezTrol
  • 114,394
  • 18
  • 182
  • 210
  • Yes, except he wants to convert 'u\u0308berhaupt' to 'überhaupt'. – Roddy Nov 18 '10 at 12:55
  • That's true, I didn't read the question well. This code merely converts the C-like notation to 'real' characters. After this, you should still use NormalizeString to, well, normalize the string. That way you can achieve the desired conversion. – GolezTrol Nov 18 '10 at 13:12
  • The "\" character is escaped as "\\", easy to handle. Thank you, combining your parser with NormalizeString should solve the problem. – Erwin Jurschitza Nov 18 '10 at 15:04
  • Small changes made to the parser in my answer: Don't go beyond the string length with s[i+1], convert the 4 chars as a hex string and unescape the backslash itself. – Erwin Jurschitza Nov 18 '10 at 16:16
0

GolezTrol, you forget '$'

if (s[i+1] <> 'u') or not TryStrToInt('$'+Copy(s, i+2, 4), c) then
good
  • 9
  • 1
0

If I'm not mistaken, Delphi XE now supports regular expressions. I don't use them that often, though, but it seems a good way to parse the string and then replace all escaped values. Maybe someone has a good example of how to do this in Delphi with regular expressions?

Wim ten Brink
  • 25,901
  • 20
  • 83
  • 149