string.sub issue with non-English characters

Question

I need to get the first char of a text variable. I achieve this with one of the following simple methods:

string.sub(someText,1,1)

or

someText:sub(1,1)

If I do the following, I expect to get 'ñ' as the first letter. However, the result of either of the sub methods is 'Ã'

local someText = 'ñññññññ'
print('Test whole: '..someText) 
print('first char: '..someText:sub(1,1))
print('first char with .sub: '..string.sub(someText,1,1))

Here are the results from the console:

2014-03-02 09:08:47.959 Corona Simulator[1701:507] Test whole: ñññññññ
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char: Ã
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char with .sub: Ã

It seems like the string.sub() function is encoding the returned value in UTF-8. Just for kicks I tried using the utf8_decode() function that's provided by Corona SDK. It was not successful. The simulator indicated that the function expected a number but got nil instead.

I also searched the web to see if anyone else had ran into this issue. I found out that there is a fair amount of discussion on Lua, Corona, Unicode and UTF-8 but I did not come across anything that would address this specific problem.

"`string.sub` function is encoding the returned value in UTF-8"—that would only be the case if your source data was encoded as UTF-8. No standard Lua library changes encodings. Regardless, you absolutely must know the character set and encoding of all string data that you process (though often it is sufficient to know it is the system default). — Tom Blodget, Mar 02 '14 at 16:30

score 5 · Accepted Answer · edited Oct 12 '15 at 20:51

5

Lua strings are 8-bit clean, which means strings in Lua are a stream of bytes. The UTF-8 character ñ has multiple bytes, but someText:sub(1,1) returns only the first single byte.

For UTF-8 encoding, all characters in the ASCII range have the same representation as in ASCII, that is, a single byte smaller than 128. For other CodePoints, a sequences of bytes where the first byte is in the range 194-244 and continuation bytes are in the range 128-191.

Because of this, you can use the pattern ".[\128-\191]*" to match a single UTF-8 CodePoint (not Grapheme):

for c in "ñññññññ":gmatch(".[\128-\191]*") do -- pretend the first string is in NFC
    print(c)
end

Output:

ñ
ñ
ñ
ñ
ñ
ñ
ñ

edited Oct 12 '15 at 20:51

Deduplicator

44,692
7
66
118

answered Mar 02 '14 at 15:41

Yu Hao

119,891
44
235
294

Good answer but I think "ASCII" should be de-emphasized: "characters in the C0 Controls and Basic Latin block have the same representation (\000-\127) in many character sets and encodings currently in use." I'd argue that ASCII is not currently in use. – Tom Blodget Mar 02 '14 at 17:00
Thanks for your reply. That's very good information. One more help would help me a lot. Any ideas how I would split the rest of the text? This pattern matching finds the first letter. I want to be able to split the rest of the letter and display them properly as well. – C. Ulker Mar 02 '14 at 17:34
@C.Ulker You can use `string.gmatch()` to split, see the update. – Yu Hao Mar 02 '14 at 17:50
@TomBlodget Really? I always thought ASCII is so well known that every programmer would know what it is. Anyway, I added the Wikipedia link just in case. – Yu Hao Mar 02 '14 at 17:51
My point was that _too many_ people know about ASCII. They think they are using it but they aren't and thereby run into problems like the question asker's. – Tom Blodget Mar 02 '14 at 18:10
@cpburnz: Are you aware of the difference between codepoint and grapheme? Both are legitimate definitions of character, as is byte btw... – Deduplicator Jul 28 '14 at 12:28
@Deduplicator I'm aware of the difference. When I reviewed your edit, I thought the change in wording added unnecessary complexity to the answer without further explaining the difference between character, code point, and grapheme. – Uyghur Lives Matter Jul 28 '14 at 14:32
@cpburnz: Well, I'll remember that right equals unneccessarily complex. Trouble is many are not aware of it, and pretending it does not exist perpetuates that... – Deduplicator Jul 28 '14 at 14:36
@Deduplicator Perhaps it would be best to ask about this on meta. My review is just my opinion after all. – Uyghur Lives Matter Jul 28 '14 at 16:37

score 0 · Answer 2 · answered Mar 02 '14 at 18:29

0

Regarding the used character set: Just know what requirements you bake into your own code, and make sure those are actually satisfied. There are various typical requirements:

ASCII-compatible (aka any byte < 128 represents an ASCII character and all ASCII characters are represented as themselves)
Fixed-Size vs Variable-Width (maybe self-synchronizing) Character Set
No embedded 0-bytes

Write your code so you need as few of those requirements as cannot be avoided, and document them.

match a single UTF-8 character: Be sure what you mean by UTF-8 character. Is it Glyph or CodePoint? AFAIK you need full unicode-tables for glyph-matching. Do you actually have to get to this level at all?

answered Mar 02 '14 at 18:29

Deduplicator

44,692
7
66
118

A 0-byte is handled in Lua strings just like any other byte value. It isn't something to be avoided. Data is data. Lua lets it be. – Tom Blodget Mar 02 '14 at 22:43
Actually, not all lua library functions are currently (5.2) fully 0 clean ATM. Compare io.lines / file:lines / io.read / file.read with format *l or *L (std format). That's the only reason i listed point 3 for character sets without disclaimer 'does not apply to lua'. Also, I only added this as a different solution, because I could not comment above and the first solution is deceptively wrong. – Deduplicator Mar 03 '14 at 16:15
Yeah, my knowledge about Unicode is rather limited, I saw what you suggested edit on my answer, including the ones that aren't approved. I'd love to see your solution to this question, I mean, write some code and let us see what you mean. – Yu Hao Mar 04 '14 at 07:16
And since you can't comment on my answer right now, you can leave comments here and @me, I'll update my answer once I see what's wrong about it. – Yu Hao Mar 04 '14 at 07:22
Ok, the short of it is this: Unicode describes CodePoints and how they are combined / compared / collated, as well as multiple encodings for those CodePoints. Using the Unicode rules, a Character can be comprised of multiple CodePoints. Your pattern does match a single CodePoint encoded as UTF-8, and if you get lucky (or unlucky?) this CodePoint might be a complete Character in your test data. No guarantees though. At the moment http://en.wikipedia.org/wiki/Combining_character lists four separate unicode ranges containing combining CodePoints. – Deduplicator Mar 04 '14 at 11:01
...continued: That is not the whole story btw, it just gets messier when you add indian, chinese, hebrew and other scripts. Then you really have to split on grapheme clusters instead of graphemes, because that's what most resembles Characters. I won't get into that mess, because I do not understand it. So, pick a library which does, if you really have to, or study unicode for real. Have fun. – Deduplicator Mar 04 '14 at 11:34

string.sub issue with non-English characters

2 Answers2