Anything odd about Chinese unicode characters 稍 and 稊 that would affect KDiff3?

Question

I have reported a bug and entered a support request at the KDiff3 site (https://sourceforge.net/p/kdiff3/bugs/198/), but I wonder if anyone has any prompt information for me about a behavior I'm seeing that might lead me to understanding why such a bug might exist -- if there's anything unusual about these unicode characters.

When I merge two identical files containing the character 稍 using KDiff3 version 0.9.98, it reads the character as 稊 and shows that character in all the panes of the merge. The output then contains that character instead of 稍.

I've observed this behavior with UCS-2 Little Endian encoding in version 0.9.98 of KDiff3, but not with UTF-8 encoding, and not with ~~version 0.9.96a~~ the version of Kdiff3 that comes with TortoiseHg. Although I can reproduce the problem in 0.9.96 and 0.9.97, TortoiseHg's KDiff3 reports that it is version 0.9.96a, and does not exhibit the problem.

Edit: I vaguely suspect the source of the problem to be somewhere in the Qt library. So any information about what Qt does especially in regard to handling international text might be useful.

I find it a curious coincidence that the two characters end in `0d` and `0a`, which are the ASCII return and linefeed codes. Their UTF-8 representations also end in `8d` and `8a`, which are those same codes with the high bit set. This leads me to believe the error has something to do with line ending conversion. — Mark Ransom, Jan 07 '15 at 20:29
I did also notice that KDiff3 reports an odd error about inconsistent line endings when attempting to perform this test merge despite the fact that there are no line endings. — BlueMonkMN, Jan 07 '15 at 20:42
@MarkRansom, good observation! You should put that as the answer. — Mark Tolonen, Jan 08 '15 at 06:07
@MarkTolonen thanks for the suggestion. I did that and expanded on the explanation. — Mark Ransom, Jan 08 '15 at 17:40

score 1 · Accepted Answer · answered Jan 08 '15 at 17:39

Utilities that process text files need to break the text into characters to operate effectively. The simplest possible process is to treat each 8-bit byte as a single character. Unfortunately this doesn't work well with UTF-16 or UCS-2 input, since each byte is only half of the character.

The character you're having problems with is 稍 (U+7a0d) which is being converted to 稊 (U+7a0a). When you break those down into little-endian bytes, you get 0x0d, 0x7a and 0x0a, 0x7a. The 8-bit character 0x0d is the ASCII code for Return, and 0x0a is the code for Linefeed. It seems that KDiff3 is interpreting these bytes as line endings, and substituting a Linefeed when it encounters a Return. This is verified by your report of an error message indicating inconsistent line endings in the file.

When working with Unicode it is often better to use UTF-8 encoding. The characters above U+007f will still take up more than one byte, but each of those bytes will have a value of 0x80 or greater and cannot accidentally be mistaken for one of the ASCII characters. For example 稍 becomes 0xe7, 0xa8, 0x8d.

I have confirmation from the Kdiff3 developer (visible at the link provided) that it is indeed a bug with handling of line endings, which he intends to fix soon. I generally prefer UTF-8 too, but in this case, we are dealing with a specifically Chinese file that is kind of large, and under these conditions, as I understand it, Unicode is generally a bit more efficient due to the smaller file size and simpler character boundaries. Performance isn't a huge concern, but I also have to consider other code using this file and whether it can handle UTF-8. For now I'll use KDiff3 0.9.95. — BlueMonkMN, Jan 08 '15 at 19:58

Anything odd about Chinese unicode characters 稍 and 稊 that would affect KDiff3?

1 Answers1