3

I want to develop a hex-dump-view and have problems with characters which are not printable in the current active ANSI codepage (CP_ACP). How do I detect them and print a dot instead?

My function currently looks like this:

function HexChar(j: byte): AnsiChar;
begin
  if j < $20 then result := '.'

  // Dirty workaround which only supports the undefined characters of Windows-1252
  else if (GetACP=1252) and ((j=$81) or (j=$8D) or (j=$8F) or (j=$90) or (j=$9D)) then result := '.'

  else result := AnsiChar(j);
end;

Using Delphi XE4 and the font Courier New, the characters $81, $8D, $8F, $90, $9D are invisible. GetACP returns 1252, so I am using Windows-1252 . According to Wikipedia, the range I discovered is not defined in Windows-1252. How can I check if the character with ordinal value j is defined in the current active codepage or not?

Daniel Marschall
  • 3,739
  • 2
  • 28
  • 67
  • You are going to need to define your character set. Right off the bat your code commits serious abuses. `Char` is a two byte UTF-16 character. Which is not what you want. For a hex editor you want to use ASCII or perhaps one of the ANSI code pages. You need to make some decisions in that regard. Two byte `Char` won't help at all. – David Heffernan Jun 24 '14 at 11:48
  • I want an ANSI dump. I thought `Char` is OK, because the ANSI character will be automatically mapped to unicode. `HexChar` will be called by the `HexDump` function, which will build the human-readable column at the right using `s := s + HexChar(x)`. – Daniel Marschall Jun 24 '14 at 11:52
  • There are many ANSI code pages. Which one do you want? And why would you store 8 bit data in a 16 bit type. Do be aware that `Chr(j)` does not convert from ANSI to Unicode as you think. It yields a UTF-16 character element with ordinal value `j`. – David Heffernan Jun 24 '14 at 11:57
  • I would like to have the ANSI charset which is active at the system of the user (CP_ACP). So he/she will see the output exactly as known from the majority of hex-editors. I will use `AnsiChar(j)` now. – Daniel Marschall Jun 24 '14 at 12:00
  • `Memo1.Text := AnsiChar($88)` yields a caret sign on my machine, with the memo's font face set to Courier New. How about you give us an SSCCE. – David Heffernan Jun 24 '14 at 12:04
  • Sorry, I had made an incorrect change in the code. Using `AnsiChar` does solve the first problem that the wrong codepage was used. I have edited the OP. Now the only problem is that $81, $8D, $8F, $90 and $9D are not defined in CP1252, and I would like to dynamically detect if the ordinal value `j` is defined in the user's ACP. – Daniel Marschall Jun 24 '14 at 12:13
  • You can use [`isprint`](http://msdn.microsoft.com/en-us/library/ewx8s4kw.aspx) function (or one from that family) imported in `System.Win.Crtl`. They are exactly intended for this purpose. If you would like to "dot" also spaces, use [`isgraph`](http://msdn.microsoft.com/en-us/library/wsfaff19.aspx) which is used to determine whether the char can be seen when you render it. – TLama Jun 24 '14 at 13:00

2 Answers2

2

Call GetStringTypeW function which supports detailed character classification.

It's also possible to use GetStringTypeEx or deprecated GetStringTypeA functions, but both just calls GetStringTypeW according to MSDN. Also, GetStringTypeEx hides difference between ANSI and Unicode versions and recommended by MSDN for character type retrieval.

Another possibility is to use TCharacter.GetUnicodeCategory() method from character.pas.

ThinkJet
  • 6,725
  • 24
  • 33
  • Alas, `IsCharAlphaNumeric` does not print things like `^` or `~` either, since they are neither numeric nor alpha. – Daniel Marschall Jun 24 '14 at 12:54
  • Thanks for this hint. `GetStringType` seems to be very reliable. Here is my [code using `GetStringTypeW`](http://pastebin.com/2SpjQadb) as well with the obsolete [`GetStringTypeA`](http://pastebin.com/SR25banY). I still have one problem left. $98 maps at my computer to Unicode [$02DC](http://www.fileformat.info/info/unicode/char/2dc/index.htm) . Therefore this tilde will automatically merge with its neighbor (and therefore the hex-dump does not look good). Can this composition-info also be queries using `GetStringTypeW`/`CT_CTYPE3` ? – Daniel Marschall Jun 24 '14 at 14:47
  • Which function you use to display result string? I think that shouldn't be any problem if you output original ANSI char used as input to `IsAnsiPrintable()` with appropriate function ... – ThinkJet Jun 24 '14 at 15:37
  • I concatenate every character to a WideString which I display in a TMemo. – Daniel Marschall Jun 24 '14 at 15:49
  • Use `AnsiString` to concatenate original ANSI characters and then assign it to TMemo strings. – ThinkJet Jun 24 '14 at 16:02
  • I think it is no error in the code, but a problem with "Courier New". Only with this font (not with Courier, not with Arial), the following looks wrong: `a˜a` . It is enough to paste this in a memo with font "Courier New". The code which generated this was: `memo1.Text := string(AnsiString('a' + ansichar($98) + 'a'))` using CP1252 or directly `memo1.Text := 'a' + char($2dc) + 'a'` . Do I need to use another font (i.e. Courier instead of Courier New) or is there a workaround? – Daniel Marschall Jun 24 '14 at 17:56
  • (Actually, what I want to do is: Find out if the resulting WideChar is a composition symbol, and if yes, I would like to pre-compose it with a space, so that the resulting WideChar is precomposed and therefore looks the same with each font, even if it does not support complex Unicode (like Courer does) – Daniel Marschall Jun 24 '14 at 18:05
  • As a part of Unicode standard all [combined characters](https://en.wikipedia.org/wiki/Combining_character) belongs to several fixed groups. So you can test if original symbol translated to one of such groups. I can't figure out exact combination of result flags for calling `GetStringTypeW` which can detect combining characters. May be `C3_VOWELMARK` and `C3_DIACRITIC`, but I can't find any confirmation. It's a complex problem? just check [this Unicode FAQ](http://www.unicode.org/faq/char_combmark.html) – ThinkJet Jun 24 '14 at 19:44
  • According to `GetStringTypeW`, the codepoint `chr($2dc)` has `C3_DIACRITIC` . I already tried to use `MultiByteToWideChar` with `MB_PRECOMPOSE` to precompose the character, but it didn't work. Then I tried to check for `C3_DIACRITIC` and if it was there, I appended a whitespace after it. But then many other characters, which are not "zero width" also get a whitespace after it. Another question is: Why has the character $2dc zero-width in "Courier New" but full width in "Courier" and "Arial"? It seems to me like a "bug" in the font. Someone forgot to set the width of that character? – Daniel Marschall Jun 24 '14 at 19:54
  • `Courier New` is Unicode font while `Courier` and `Arial` are ANSI. Note that `Arial` and `Arial Unicode` is a different fonts. – ThinkJet Jun 24 '14 at 20:23
  • @ThinkJet I don't agree with your revert. `GetStringTypeW` is no good because the user has ANSI characters. How is he going to get a UTF-16 character. I removed my upvote. – David Heffernan Jun 24 '14 at 22:07
  • @rinntech FWIW, I think this has the potential to be a good answer, but you want `GetStringTypeA`, or perhaps `GetStringTypeExA`. – David Heffernan Jun 24 '14 at 22:08
  • @DavidHeffernan The MSDN writes "This function converts the source string to Unicode and calls the corresponding GetStringTypeW function. Thus the words in the output buffer correspond not to the original ANSI string but to its Unicode equivalent." Isn't this the same what I am doing with `s := WideString(AnsiChar(j));` before calling `GetStringTypeW` ? – Daniel Marschall Jun 25 '14 at 08:01
  • Same, but conversion done behind the scenes with locale which Delphi uses by default instead of specifying concrete locale in call to `GetSTringTypeA`. More about Unicode support in modern Delphi versions may be found in [this document on Embarcadero site](http://edn.embarcadero.com/print/images/38980/Delphi_and_Unicode.pdf). – ThinkJet Jun 25 '14 at 08:44
1

Use GetGlyphIndices with GGI_MARK_NONEXISTING_GLYPHS in order to check if a particular character exists in a font.

Here's an example:

procedure ReplaceNonPrintableChars(var s: string);
var
  GlyphIndicesA: PWordArray;
  Len: Integer;
  I: Integer;
  Cnt: DWORD;
  DC: THandle;
  C: TCanvas;
begin
  DC := GetDC(0);
  try
    C := TCanvas.Create;
    try
      C.Handle := DC;
      C.Font.Name := 'Arial';
      Len := Length(S);
      GetMem(GlyphIndicesA, SizeOf(Word) * Len);
      try
        Cnt := GetGlyphIndices(C.Handle, PChar(S), Len, PWord(GlyphIndicesA), GGI_MARK_NONEXISTING_GLYPHS);
        if not (Cnt = GDI_ERROR) then
          for I := 0 to Cnt - 1 do
            if GlyphIndicesA[I] = $FFFF then
              S[I+1] := '.';
      finally
        Dispose(GlyphIndicesA);
      end;
    finally
      C.Free;
    end;

  finally
    ReleaseDC(0, DC);
  end;
end;
Sebastian Z
  • 4,520
  • 1
  • 15
  • 30
  • Hm... how do I exactly use it? `var x: word; dc: hdc; begin dc := GetDc(Memo1.Handle); GetGlyphIndices(dc, PChar(Char(AnsiChar(j))), 1, pword(@x), GGI_MARK_NONEXISTING_GLYPHS)` returns always `GDI_ERROR` . – Daniel Marschall Jun 24 '14 at 13:22
  • I've added an example. – Sebastian Z Jun 24 '14 at 13:56
  • Your code detects if a character is defined a FONT. But the question was if the character is defined in a CODEPAGE. – Elmue Jul 29 '14 at 00:49