0

On Windows, if you have a UTF-16 sequence containing surrogate and that you insert that sequence in a RichEdit control, the RichEdit control handles this well and for each surrogate pair, it will only show one character.

The difficulty I'm facing is that when I query the selection, I'm getting the position in the UTF-16 stream, and not the character position as the number of visible characters in the control. I have a slow solution to find out the actual position, but it requires retrieving the text up to the selection in UTF-16 and then count myself the number of actual characters.

Did I miss something? Is there anything more efficient than that?

Thanks,

Manu

PS: To query the selection I'm using the EM_EXGETSEL message to fill a CHARRANGE structure.

Emmanuel Stapf
  • 213
  • 1
  • 7
  • Please show the code you are having trouble with. – Remy Lebeau Apr 09 '14 at 16:49
  • The code is quite simple, on a RICHEDIT control, you want to show the cursor position to the user as the n-th character position. Unfortunately, Windows returns you the n-th code unit position in the UTF-16 stream. The question is if there is an API/message for RICHEDIT that would give me this information without me calculating it. – Emmanuel Stapf Apr 10 '14 at 15:18
  • I would still like to see your code, including your "slow solution". Maybe there is a way to speed it up, if not replace it. Also, which OS version are you seeing this on? I have never seen `EM_(EX)GETSEL` return UTF-16 codeunit offsets before, but visible character offsets instead, just like it is documented to do. I will try to reproduce it. – Remy Lebeau Apr 10 '14 at 15:46
  • One thing that is annoying when you have to count this yourself is that you are forced to copy the text back and forth between the control and your application making it slower. It would be easier if one could avoid this by going through the control's own buffer. But maybe there is no other solution. – Emmanuel Stapf Apr 14 '14 at 21:53
  • I can confirm that `EM_EXGETSEL` is indeed retrieving UTF-16 offsets, not visual character offsets. Which does make a little bit of sense if you assume it is returning the selected characters from the underlying text, which is UTF-16. – Remy Lebeau Apr 14 '14 at 22:46
  • Even round-tripping through `EM_POSFROMCHAR` and `EM_CHARFROMPOS` still returns UTF-16 positions. – Remy Lebeau Apr 14 '14 at 22:54
  • Thanks for spending the time in looking in the problem and confirming my analysis. If anyone knows a solution where we do not have to count offsets manually, feel free to share. – Emmanuel Stapf Apr 16 '14 at 00:08

1 Answers1

-1

The problem is real enough and it's only going to get more frequent. Single code points in UTF-16 only reach 64K characters and there are nearly 300K of them now.

What you will see is a pair of character positions (short values) that display as a single character. There will only ever be two, under the current standards.

In .Net code there are particular functions to do this work for you. I am not aware of any in WinApi. You can process the text using functions that test using the macros IS_HIGH_SURROGATE, IS_LOW_SURROGATE, and IS_SURROGATE_PAIR. I see no reason they should be any slower than built-in functions, but you have to write them (unless you can find some source code somewhere).

This article may be helpful: Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?.

Community
  • 1
  • 1
david.pfx
  • 10,520
  • 3
  • 30
  • 63
  • -1 This does not address Emmanuel's question about mapping between UTF-16 offsets and RichEdit character positions. – Remy Lebeau Apr 09 '14 at 16:48