Lua gmatch odd characters (Slovak alphabet)

Question

I am trying to extract the characters from a string of a word in Slovak. For example, the word for "TURTLE" is "KORYTNAČKA". However, it skips over the "Č" character when I try to extract it from the string:

local str = "KORYTNAČKA"
for c in str:gmatch("%a") do print(c) end
--result: K,O,R,Y,T,N,A,K,A

I am reading this page and I have also tried just pasting in the string itself as a set, but it comes up with something weird:

local str = "KORYTNAČKA"
for c in str:gmatch("["..str.."]") do print(c) end
--result: K,O,R,Y,T,N,A,Ä,Œ,K,A

Anyone know how to solve this?

Use texts encoded with 1-byte Slovak codepage instead of UTF-8 — Egor Skriptunoff, Apr 09 '14 at 06:32

score 5 · Accepted Answer · edited May 23 '17 at 12:30

5

Lua is 8-bit clean, which means Lua strings assume every character is one byte. The pattern "%a" matches one-byte character, so the result is not what you expected.

The pattern "["..str.."]" works because, a Unicode character may contain more than one byte, in this pattern, it uses these bytes in a set, so that it could match the character.

If UTF-8 is used, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence in Lua 5.2, like this:

local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do 
    print(c) 
end

In Lua 5.1(which is the version Corona SDK is using), use this:

local str = "KORYTNAČKA"
for c in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do 
    print(c) 
end

For details about this pattern, see Equivalent pattern to “[\0-\x7F\xC2-\xF4][\x80-\xBF]*” in Lua 5.1.

edited May 23 '17 at 12:30

Community

1
1

answered Apr 09 '14 at 06:11

Yu Hao

119,891
44
235
294

2

Adding a note here: the upcoming Lua 5.3 will add a basic `utf8` library. – Yu Hao Apr 09 '14 at 06:19
When I try this, it gives me error: "malformed pattern (missing ']')". I am programming in Lua using Corona SDK. – Omid Ahourai Apr 09 '14 at 07:21
@ArdentKid You got me, I think it's because Corona SDK is using Lua 5.1, but I haven't got it working yet. I asked [a question](http://stackoverflow.com/q/22956136/1009479) about this, and will update this answer once somebody answers mine. – Yu Hao Apr 09 '14 at 07:54
It looks like he's found a solution, however it's not really doing what I would like (parsing the individual characters) and doesn't include the odd character. – Omid Ahourai Apr 09 '14 at 08:17
@ArdentKid See the update, `"[%z\1-\127\194-\244][\128-\191]*"` should work in Lua 5.1 – Yu Hao Apr 09 '14 at 12:41

Petr Abdulin · Answer 2 · 2014-04-09T08:20:56.227

1

Lua has no built-in treatment for Unicode strings. You can see that Ä,Œ is a 2 bytes representing UTF-8 encoding of a Č character.

Yu Hao already provided sample solution, but for more details here is good source.

I've tested and found this solution working properly in Lua 5.1, reserve link. You could extract individual characters using utf8sub function, see sample.

edited Apr 09 '14 at 08:20

answered Apr 09 '14 at 06:16

Petr Abdulin

33,883
9
62
96

score 0 · Answer 3 · edited Jul 13 '18 at 12:26

0

string.gmatch(str, "[%z\1-\127\192-\253][\128-\191]*")

edited Jul 13 '18 at 12:26

tuomastik

4,559
5
36
48

answered Jul 13 '18 at 09:34

mengyue

11

score 0 · Answer 4 · answered Jul 14 '18 at 08:40

Use utf8 plugin. Then replace string.gmatch with utf8.gmatch.

Example (tested on Win7, it works for me)

yourfilename.lua

local utf8 = require( "plugin.utf8" )

for c in utf8.gmatch( "KORYTNAČKA", "%a" ) do print(c) end

and

build.settings

settings =
{
    plugins =
    {
        ["plugin.utf8"] =
        {
            publisherId = "com.coronalabs"
        },
    },      
}

Lua gmatch odd characters (Slovak alphabet)

4 Answers4

Linked