2

I have a Delphi 7 application where I deal with ANSI strings and I need to count their number of characters (as opposed to the number of bytes). I always know the Charset (and thus the code page) associated with the string.

So, knowing the Charset (code page), I'm currently using MultiByteToWideChar to get the number of characters. It's useful when the Charset is one of the Chinese, Korean, or Japanese charsets where most of the characters are 2 bytes in length and simply using the Length function won't give me what I want.

However, it still counts composite characters as two characters, and I need them counted as one. Now, some composite characters have precomposed versions in Unicode, those would be counted correctly as one character since the MB_PRECOMPOSED is used by default. But many characters simply don't exist as precomposed, for example characters in Hebrew, Arabic, Thai, etc, and those are counted as two.

So the question really is: How to count composite characters as single characters? I don't mind converting the ANSI strings to Wide strings to count the number of characters, I'm already doing it with MultiByteToWideChar anyway.

J...
  • 30,968
  • 6
  • 66
  • 143
jedivader
  • 828
  • 10
  • 23

1 Answers1

2

You can count the Unicode code points like this:

function CodePointCount(P: PWideChar): Integer;
var
  Count: Integer;
begin
  Count := 0;
  while Word(P^)<>0 do
  begin
    if (Word(P^)>=$D800) and (Word(P^)<=$DFFF) then
      // part of surrogate pair
      inc(Count)
    else 
      inc(Count, 2);
    inc(P);
  end;  
  Result := Count div 2;
end;

This covers the issue that you did not mention. Namely that UTF-16 is a variable width encoding.

However, this will not tell you the number of glyphs represented by a UTF-16 string. That's because some code points represent combining characters. These combining characters combine with their neighbours to form a single equivalent character. So, multiple code-points, single glyph. More information can be found here: http://en.wikipedia.org/wiki/Unicode_equivalence

This is the harder issue. To solve it your code needs to fully understand the meaning of each Unicode code point. Is it a combining character? How does it combine? Really you need a dedicated Unicode library. For instance ICU.

The other suggestion I have for you is to give up using ANSI code pages. If you really care about internationalisation then you need to use Unicode.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • `MultiByteToWideChar` already covers the fact that UTF-16 is a variable-length encoding and returns the same result as your function. I was hoping there might be another api function that returns the actual number of glyphs, taking into consideration combining characters. Yeah, I know I have to migrate to Unicode, but that will take a lot of time, so I need temporary solutions until then. The question really becomes: **Is there a lightweight Unicode library that is compatible with Delphi 7 and has a straightforward function for getting the number of glyphs?** Perhaps the Soft Gems one? – jedivader Feb 22 '14 at 16:22
  • Use ICU. And no, MultiByteToWideChar returns the number of code points. – David Heffernan Feb 22 '14 at 16:24
  • Then I'm a little bit confused about `MultiByteToWideChar`. The [documentation](http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx) says "If this value is 0, the function returns the required buffer size, in **characters**". And if I give it Chinese characters, they are counted correctly. What am I missing here? – jedivader Feb 22 '14 at 20:17
  • I misspoke before. When I said code points I should have said character elements. That is the number of WideChar values. That's what MSDN means when it says characters. – David Heffernan Feb 22 '14 at 20:46
  • Okay, I got it. The reason why it seems to me that your function doesn't return anything different than `MultiByteToWideChar` (and in my situation it really doesn't) is that I always feed it with ANSI strings (that's what the question is about), and they always fall in the [Basic Multilingual Plane](http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane) which always uses a single WideChar (Word, 2-bytes), never surrogate pairs. So I'm essentially using UCS-2, not UTF-16. – jedivader Feb 23 '14 at 10:42