Delphi XE AnsiStrings with escaped combining diacritical marks

Question

What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?

I am aware of the fact that this is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.

As far as I know, that's not any standard string format, so you'll have to decode it yourself. Which part are you having trouble with, decoding the escaped characters, or finding the corresponding combined character? — Rob Kennedy, Nov 18 '10 at 14:54
Decoding the escaped characters is trivial; finding the corresponding combined character is the problem. But it looks like the WinAPI call NormalizeString as suggested by Roddy pointed me into the right direction. — Erwin Jurschitza, Nov 18 '10 at 14:59

Roddy · Accepted Answer · 2010-11-18T12:52:17.130

4

I think you need to perform Unicode Normalization. on your string.

I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:

NormalizationKC

Unicode normalization form KC, compatibility composition. Transforms each base plus combining characters to the canonical precomposed equivalent and all compatibility characters to their equivalents. For example, the ligature ﬁ becomes f + i; similarly, A + ¨ + ﬁ + n becomes Ä + f + i + n.

edited Nov 18 '10 at 12:52

answered Nov 18 '10 at 12:35

Roddy

66,617
42
165
277

Thank you very much, I will take a look at the NormalizeString function. – Erwin Jurschitza Nov 18 '10 at 15:00

Erwin Jurschitza · Answer 2 · 2010-11-19T18:11:08.483

Here is the complete code that solved my problem:

function Unescape(const s: AnsiString): string;
var
  i: Integer;
  j: Integer;
  c: Integer;
begin
  // Make result at least large enough. This prevents too many reallocs
  SetLength(Result, Length(s));
  i := 1;
  j := 1;
  while i <= Length(s) do begin
    if s[i] = '\' then begin
      if i < Length(s) then begin
        // escaped backslash?
        if s[i + 1] = '\' then begin
          Result[j] := '\';
          inc(i, 2);
        end
        // convert hex number to WideChar
        else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s)) 
                and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin
          inc(i, 6);
          Result[j] := WideChar(c);
        end else begin
          raise Exception.CreateFmt('Invalid code at position %d', [i]);
        end;
      end else begin
        raise Exception.Create('Unexpected end of string');
      end;
    end else begin
      Result[j] := WideChar(s[i]);
      inc(i);
    end;
    inc(j);
  end;

  // Trim result in case we reserved too much space
  SetLength(Result, j - 1);
end;

const
  NormalizationC = 1;

function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer;
 lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';

function Normalize(const s: string): string;
var
  newLength: integer;
begin
  // in NormalizationC mode the result string won't grow longer than the input string
  SetLength(Result, Length(s));
  newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result));
  SetLength(Result, newLength);
end;

function UnescapeAndNormalize(const s: AnsiString): string;
begin
  Result := Normalize(Unescape(s));
end;

Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)

GolezTrol · Answer 3 · 2010-11-18T14:16:29.040

1

Are they always escaped like this? Always in a number of 4 digits?

How is the \ character itself escaped?

Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:

function Unescape(s: AnsiString): WideString;
var
  i: Integer;
  j: Integer;
  c: Integer;
begin
  // Make result at least large enough. This prevents too many reallocs
  SetLength(Result, Length(s));
  i := 1; j := 1;
  while i <= Length(s) do
  begin
     // If a '\' is found, typecast the following 4 digit integer to widechar
     if s[i] = '\' then
     begin
       if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
         raise Exception.CreateFmt('Invalid code at position %d', [i]);

       Inc(i, 6);
       Result[j] := WideChar(c);
     end
     else
     begin
       Result[j] := WideChar(s[i]);
       Inc(i);
     end;
     Inc(j);
  end;

  // Trim result in case we reserved too much space
  SetLength(Result, j-1);
end;

Use like this

  MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);

This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.

[edit] Code is ok. Highlighter fails.

edited Nov 18 '10 at 14:16

answered Nov 18 '10 at 12:45

GolezTrol

114,394
18
182
210

Yes, except he wants to convert 'u\u0308berhaupt' to 'überhaupt'. – Roddy Nov 18 '10 at 12:55
That's true, I didn't read the question well. This code merely converts the C-like notation to 'real' characters. After this, you should still use NormalizeString to, well, normalize the string. That way you can achieve the desired conversion. – GolezTrol Nov 18 '10 at 13:12
The "\" character is escaped as "\\", easy to handle. Thank you, combining your parser with NormalizeString should solve the problem. – Erwin Jurschitza Nov 18 '10 at 15:04
Small changes made to the parser in my answer: Don't go beyond the string length with s[i+1], convert the 4 chars as a hex string and unescape the backslash itself. – Erwin Jurschitza Nov 18 '10 at 16:16

score 0 · Answer 4 · answered Jul 08 '13 at 19:06

0

GolezTrol, you forget '$'

if (s[i+1] <> 'u') or not TryStrToInt('$'+Copy(s, i+2, 4), c) then

answered Jul 08 '13 at 19:06

good

9
1

2

This should have been posted as a comment on @GolezTrol's answer, not as an answer of its own. – Zoë Peterson Jul 08 '13 at 20:53

score 0 · Answer 5 · answered Nov 18 '10 at 15:54

If I'm not mistaken, Delphi XE now supports regular expressions. I don't use them that often, though, but it seems a good way to parse the string and then replace all escaped values. Maybe someone has a good example of how to do this in Delphi with regular expressions?

Delphi XE AnsiStrings with escaped combining diacritical marks

5 Answers5

Linked