0

I have an UFT16 encoded string theUFT16string. It contains double byte characters. I would like to interate through it Unicode character by Unicode character. I understand that the chunk expressions work by single-byte characters?

An example

We have the following string

   abcαβɣ

We want to iterate through it and put each character on a line of its own in another container.

z--
  • 2,186
  • 17
  • 33
  • By "Unicode character", are you referring to the encoded UTF-16 **codeunits** or the decoded Unicode **codepoints** that the codeunits represent? It makes a big difference. And no, UTF-16 does **not** use single-byte codeunits. UTF-8 does. – Remy Lebeau Apr 20 '13 at 14:41
  • Actually, UTF8 uses anything between 1 and 4 byte code units. UTF16 has exceptions too and that's why LiveCode is (rarely) incompatible with UTF16. – Mark Apr 21 '13 at 09:29
  • It would be nice if you accepted my answer or tell me why it doesn't help you. – Mark Jun 15 '13 at 11:27
  • The reason that I did not click the 'accept' button yet is that it does not work properly yet and I did not figure out what it actually was. I have added an example to the question. – z-- Jun 17 '13 at 11:07

1 Answers1

0

In LiveCode, there are two ways to get a character from a UTF16 string. If the string is displayed in a field, you can do

select char 3 of fld 1

and if you have a Russian or Polish text in the field, it will correctly select 1 character. However, this feature isn't very well developed in LiveCode and will fail with many Chinese, Japanese and Arabic (and other) languages. Therefore, it is better to use bytes for now:

select byte 5 to 6 of fld 1

The latter will also be compatible with future versions of LiveCode, while the former may not be.

Anyway, you have your string in a variable, which means you have to handle the string as bytes (you could use chars, but bytes and chars are dealt with in the same way in this case, because the data is in a variable). You can iterate through the variable with steps of two, i.e. one char at a time:

repeat with x = 1 to number of bytes of theUFT16String step 2
  put byte x to x+1 into myChar
  // do something with myChar here, e.g. reverse the bytes?
  put byte 2 of myChar & char 1 of myChar after myNewString
end repeat
// myNewString now contains the entire theUTF16String in reverse byte order.

(You could do this in 3 lines instead of 4, but for the purpose of the example I have added a line that stores the bytes in var myChar).

Mark
  • 2,380
  • 11
  • 29
  • 49