How to work with substrings / delete character when content is mixed single and double byte?

Question

I have an application, which unfortunately is written in Delphi 6, that allows the user to extract substrings / delete characters from input. The operator sees a TLabel, this is on purpose to avoid direct and incorrect input.

This works fine for English text. But sometimes, the text is mixed content (ANSI-encoded), I set the charset and the display works fine without modification.

One trivial example could be:

月123中

The string in hex is 8C 8E | 31 | 32 | 33 | 92 86 (split on characters for clarity)

But once modification starts, things go wrong. I know why - the standard StrLeft / StrRight / SubString works on single bytes - and once I remove a byte, the text renders meaningless.

Do I have to go through a WideString conversion and back every time I modify something? Or are there native libraries / functions in older Delphi versions to do what I need?

Something like this, in pseudocode:

mixedStringLength(content)             = 5        // 5 characters visible
mixedStingLeft(content,2)              = "月1"    // actually the bytes 8C  8E  31
mixedStringDeleteCharsRight(content,1) = "月123"  // actually deletes 2 bytes

As the string in question is used in many other places, and the project is large, it's not as trivial as just using TNT libraries, or switching over to a newer Delphi version.

UTF8 or UTF16 or UTF32 make this much simpler. The world has moved on over a decade ago. You should too. — David Heffernan, Apr 13 '22 at 07:49
@DavidHeffernan this is a valid point. But as you may know, if the project is large enough and legacy, the efford of moving to UTF16 is bigger than trying to find a solution that works now. There are plans to move on. Your statement is correct, but telling someone that running is better than crawling does not help them learn how to crawl. — MyICQ, Apr 13 '22 at 10:56
UTF-16 is also not guaranteed to be 2 byte per character. No, D6 has no native MBCS support - that's why entire libraries exist for single codepages. You either have to re-invent dealing with the chosen **codepage** or recode everything to UTF-32 so you can safely deal with always 4 bytes per character. — AmigoJack, Apr 13 '22 at 19:49
@MyICQ as you discovered, you can't break up a multi-byte string on arbitrary bytes. You have to parse the bytes in relation to the charset they are encoded in to discover logical character boundaries where you can make your breaks. There are plenty of Unicode libraries that can handle this, but nothing built in to Delphi itself. At the very least, look at the Win32 [`CharNextExA()`](https://docs.microsoft.com/en-us/windows/win32/api/winuser/nf-winuser-charnextexa) function, which allows you to iterate through a multi-byte string char-by-char based on a codepage, rather than byte-by-byte. — Remy Lebeau, Apr 13 '22 at 20:23

How to work with substrings / delete character when content is mixed single and double byte?

0 Answers0